Role : Staff Site Reliability Engineer (SRE)
Location : San Francisco, CA (Hybrid)
Job Responsibilities
As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include:
As a Staff SRE, you'll operate at the highest level of technical expertise and influence. You won't just solve problems; you'll prevent them at a fundamental level across organizational boundaries.
· Design, implement, and lead large-scale, cross-functional projects to improve the reliability, performance, and efficiency of our core services and infrastructure (10× impact).
· Drive the reduction of toil by developing and deploying sophisticated automation tools and frameworks, championing the "everything as code" philosophy.
· Serve as a technical escalation point for critical incidents, perform deep-dive root cause analyses (RCAs), and implement robust corrective measures to prevent recurrence.
· Define and implement SLOs, SLIs, and Error Budgets for critical services. Enhance our monitoring, logging, and tracing systems to provide comprehensive visibility into system health.
· Set the technical direction and best practices for the entire SRE and engineering organization. Mentor mid-level and senior engineers on design patterns, operational rigor, and reliability principles.
We're looking for a leader and a deep technical expert with a proven track record of solving the hardest scaling and reliability challenges.
Required Qualifications
· 8+ years of progressive experience in Site Reliability Engineering, Production Engineering, or a closely related role.
· Expert-level proficiency with AWS, including networking, compute, and storage.
· Deep expertise in Kubernetes and the cloud-native ecosystem.
· Fluency in at least one major scripting/programming language for automation and tooling (e.g., Python, Go, or Java).
· Solid experience with monitoring and logging solutions (Datadog)
· Proven ability to design and implement robust, highly available distributed systems.
· Demonstrated experience with Infrastructure as Code tools like Terraform.
· Exceptional communication skills, capable of explaining complex technical issues to both technical and non-technical audiences.
Nice-to-Have
· Experience implementing Service Mesh technologies (e.g., Istio, Linkerd).
· A strong understanding of security principles and practices in a cloud environment.
· Certifications such as CKA (Certified Kubernetes Administrator) or CKAD (Certified Kubernetes Application Developer).