Senior Site Reliability Engineer — combination of deep operational expertise and hands-on engineering ability. The majority of your time (~70%) will be focused on owning the reliability, availability, scalability, and operational excellence of the cloud infrastructure and SaaS platforms powering our business. The remaining ~30% puts you directly in the platform engineering flow: building automation, improving deployment pipelines, and driving reliability initiatives from conception through production.
You will write and review automation code, contribute to architecture and deployment discussions, and collaborate closely with product engineering teams to ensure operational and reliability decisions are made correctly the first time.
Key Responsibilities
Reliability Engineering & Operations (~40% of role)
Own day-to-day monitoring, alerting, operational health, and on-call support for mission-critical SaaS platforms and cloud infrastructure.
Lead major incident response activities including escalation coordination, root cause analysis, and postmortem reviews.
Design and maintain high-availability, failover, backup, and disaster recovery procedures; validate RTO/RPO targets regularly.
Investigate and resolve production incidents end-to-end across infrastructure, platform, and application layers.
Automation & Platform Engineering (~30% of role)
Design, implement, and maintain Infrastructure as Code (IaC), deployment automation, and CI/CD pipeline improvements.
Develop tooling and automation to reduce operational toil and improve engineering productivity.
Partner with development teams to improve deployment safety, release reliability, and operational scalability.
Drive standardization of cloud infrastructure, operational engineering practices, and deployment governance.
Observability & Performance Optimization (~15% of role)
Build and maintain monitoring, logging, tracing, and alerting capabilities across distributed systems.
Establish service-level objectives (SLOs), SLIs, and error budget policies.
Identify and remediate performance bottlenecks, scaling issues, and infrastructure inefficiencies.
Analyze operational telemetry and trends to improve reliability and capacity planning.
Security, Compliance & Architecture (~15% of role)
Implement operational security best practices including RBAC, least privilege access, and infrastructure hardening.
Ensure compliance with SOC 2, HIPAA, GDPR, and organizational security standards.
Participate in architecture reviews and operational readiness assessments for new services and platforms.
Mentor junior engineers on reliability engineering, cloud operations, automation, and incident management best practices.
Required Qualifications
7+ years of experience in Site Reliability Engineering, DevOps, Cloud Infrastructure, or Production Operations roles.
Strong experience operating workloads in cloud environments such as Microsoft Azure, AWS, or Google Cloud.
Hands-on experience with Kubernetes, Docker, CI/CD pipelines, and Infrastructure as Code tools.
Strong scripting and automation skills using Python, Bash, PowerShell, Go, or similar languages.
Experience with observability and monitoring platforms such as Datadog, Grafana, Prometheus, or Splunk.
Strong understanding of networking, Linux/Windows administration, distributed systems, and cloud-native architectures.
Experience with incident response, production troubleshooting, and operational governance.
Strong communication skills and ability to collaborate across engineering and business teams.
Preferred Qualifications
Experience supporting multi-tenant SaaS environments.
Experience with Terraform, Bicep, ARM templates, or Ansible.
Familiarity with GitOps and modern deployment strategies such as canary or blue/green deployments.
Experience working within regulated or compliance-driven environments.
Relevant cloud or Kubernetes certifications.