Job Title: Senior Site Reliability Engineer (SRE)
Location: Bay Area CA (Onsite / Remote)
Duration: 6+ Month Contract
Job Description:
We are seeking an SRE Engineer focused on Observability, Kubernetes, and Cloud Infrastructure to support our large-scale Google Cloud Platform/AWS/EKS platform. This role is central to improving SLO reliability, logging pipelines, distributed tracing, dashboards, and automated diagnostics across 10,000+ applications running in EKS.
Responsibilities:
Own observability stack: PrometheGrafana, OpenTelemetry, Loki/ELK/Splunk, Jaeger, Alertmanager, SLO frameworks.
Build intelligent monitoring pipelines and ensure high reliability of metric ingestion, log ingestion, tracing, and analytics systems.
Develop Terraform modules for observability infrastructure, K8s components, cluster add-ons, and monitoring services. Improve reliability of AWS/Google Cloud Platform/EKS clusters through automation, performance tuning, capacity modeling, and event-driven remediation.
Build AI-assisted diagnostics for anomaly detection, auto-alert tuning, automated playbooks, and noise reduction.
Partner with Platform Engineering to ensure Istio/service mesh telemetry, API server health, and node-level insights.
Lead operational readiness, SLO reporting, incident management, and root cause analysis for platform outages.
Qualifications:
6-8 years in SRE, Infrastructure, or Kubernetes operations. Strong knowledge of EKS/ECS/GKE, Kubernetes internals, and cluster operations. Expertise in observability stacks (Prometheus, OTel, Grafana, ELK, Datadog, Splunk). Advanced Terraform IaC and automation skills (Python/Go preferred). Experience with CI/CD, cloud networking, service mesh (Istio), and capacity planning.