Senior site reliability engineer - Google Cloud Platform
Remote
Top Skills Details
•Cloud: Google Cloud Platform expertise; comfort with cloud native - open to other cloud platform experience
•Resiliency & Chaos Engineering: Strong grasp of resiliency concepts; ability to design safe, hypothesis driven chaos experiments and interpret outcomes to harden systems.
•Automation & CI/CD: Proven ability to improve CI/CD (policy gates, test automation, canary/blue green); experience transitioning platforms (e.g., Jenkins → Harness).
•Load/Performance: Hands on expertise with k6 and/or SmartBear load tools; capacity modeling; performance bottleneck analysis; test in pipeline practices.
•IaC & Platform: Terraform module design/standards; Helm chart authoring/ops for Kubernetes; config as code for Akamai where feasible.
•Observability: Deep experience setting up APM/logs/metrics (AppDynamics, Splunk), building actionable alerts, and designing dashboards around SLOs/SLIs.
•Programming: Proficiency in Python and JavaScript; familiarity with Kotlin and Groovy (especially in CI/CD pipelines).
Description
SRE Modernization & Reliability Engineering
• Lead SRE modernization aligned with DevOps principles: reliability by design, automation first operations, and service ownership across build run lifecycles (tool agnostic mindset, strong principles).
• Define service level objectives/indicators (SLOs/SLIs) and error budgets; partner with product and engineering to balance feature velocity with reliability.
• Establish fault tolerance baselines before production: codify and validate redundancy, graceful degradation, and recovery characteristics in pre prod environments.
Chaos & Resiliency Engineering
• Build and run a structured chaos engineering program to continuously test resiliency in lower environments first, then in production with guardrails.
• Use Gremlin for experiment orchestration; define hypotheses, blast radius controls, and success criteria; expand with vetted open source tooling as appropriate.
• Translate findings into reliability backlogs and architectural improvements; drive blameless postmortems and preventive design patterns.
Observability & Alerting
• Mature end to end observability (app, infra, network, CDN) with proper, actionable alerting—reduce noise, tighten signal, and ensure runbook backed alerts.
• Implement and optimize AppDynamics (APM) and Splunk (logs, analytics) to deliver high fidelity telemetry, business level health indicators, and golden signals.
• Extend observability to our CDN (Akamai) for edge performance, cache health, and origin protection; integrate with runbooks and incident workflows. (Observability responsibilities consistent with senior SRE templates.)
Performance, Load, and Capacity
• Own load and performance testing strategy—why we test (resiliency goals), what we test (user journeys, critical paths), and how we test (shift left, pipeline driven).
• Operate and evolve tooling: k6, SmartBear (e.g., LoadNinja/ReadyAPI), and vetted third party services; embed tests in CI/CD; feed results to capacity planning.
Deployment Automation & CI/CD
• Automate deployments end to end; enforce progressive delivery, canaries, and blue/green patterns with automated rollback. (Aligned with standard SRE responsibilities.)
• Drive CI/CD process improvements, help migrate from Jenkins to Harness (under evaluation); standardize quality gates, policy as code, and reliability checks in pipelines.
Platform Engineering, IaC & Kubernetes
• Standardize Infrastructure as Code across clouds and platforms: Terraform modules, policy controls, and repeatable environments.
• Operationalize Helm charts for Kubernetes services, ensuring versioning, security baselines, and rollout strategies (canary/blue green).
• Partner on Akamai configuration as code—codify edge policies, cache/CDN rules, and security controls; version and promote through environments.
Tooling Evaluation & Gap Closing
• Continuously evaluate tools, identify gaps across reliability, observability, chaos, and performance; build the roadmap to mature our environment and close those gaps. (This aligns with senior SRE strategic planning expectations.)
Incident Response & Operations Excellence
• Participate in and help optimize the on call rotation; reduce MTTA/MTTR through better detection, automation, and runbooks.
• Run blameless postmortems; convert systemic issues into durable engineering fixes and platform improvements.
Our Environment (What You’ll Work With)
• Cloud: Google Cloud (Google Cloud Platform) in a more cloud native posture, including private connectivity patterns/VPC scoped services.
• CDN/Edge: Akamai (multi layer observability + config as code).
• Observability: AppDynamics (APM), Splunk (logs/analytics), with alert standards and runbooks.
• Chaos/Resiliency: Gremlin, plus curated open source tools where they add value.
• Performance/Load: k6, SmartBear, and select third party load services.
• CI/CD: Jenkins → Harness migration (in evaluation); progressive delivery patterns and automated rollbacks.
• IaC/Containers: Terraform, Helm (Kubernetes).
• Languages: Python, JavaScript; some Kotlin and Groovy in CI/CD contexts.
External Communities Job Description
Here\''s your chance to work for a leading global pizza company and contribute to several key initiatives for a Resiliency team.
Additional Skills Tags
cicd,jenkins,harnesses,splunk,Python,groovy,kotlin,javascript,Terraform,Kubernetes
Additional Skills & Qualifications
DevOps experience - won\''t be DevOps first, SRE first
Proactive mindset
Pre Prod and Post Prod environment support