Overview
Skills
Job Details
Location: Atlanta, GA
Client Industry: Retail
Position Type: Full-time/ Contract
About the Role
We are seeking a Platform SRE Lead to drive reliability, scalability, and performance across our retail client s digital ecosystem. As the SRE Lead, you will play a pivotal role in ensuring production resilience for mission-critical services like e-commerce platforms, order management, payments, and digital customer experiences. This role will focus on platform-level observability, automation, and operational excellence, working closely with product engineering, infrastructure, and DevOps teams.
Key Responsibilities
Platform Reliability & Operations
Lead Platform SRE activities across Google Cloud Platform, Kubernetes, GCE, GCS, Pub/Sub, Dataflow, and Apigee ecosystems.
Ensure high availability (HA) and scalability of core retail applications, including checkout, payments, cart, search, and catalog services.
Establish and enforce SLIs, SLOs, and SLAs at platform and application levels.
Drive incident management and root cause analysis (RCA) for production issues.
Monitoring & Observability
Standardize dashboards in Dynatrace, Cloud Monitoring, and Logging for trans/non-trans workloads.
Define KPIs for edge (Akamai, Nginx), middleware (Apigee, GKE), and data services (Redis/Memstore, BigQuery, Postgres/Mongo).
Build real-time alerting frameworks with noise reduction and actionable metrics.
Automation & Tooling
Design self-healing mechanisms and automation for production recovery.
Optimize CI/CD pipelines (GitLab, ArgoCD, Jenkins) for safe and fast deployments.
Implement infrastructure as code (Terraform/Helm) for consistent environment provisioning.
Leadership & Collaboration
Mentor and lead a global team of SREs, establishing best practices in reliability engineering.
Partner with Application SRE, DevOps, and Development teams to align priorities.
Act as the primary escalation point for critical platform issues in production.
Influence technology decisions around cloud-native adoption, cost optimization, and security.
Required Qualifications
10+ years of IT experience with 5+ years in SRE/DevOps/Platform Engineering roles.
Proven experience in retail/e-commerce production environments.
Expertise in Google Cloud Platform (Google Cloud Platform) services: GKE, Pub/Sub, GCS, Dataflow, BigQuery, Apigee.
Strong skills in Kubernetes (scaling, troubleshooting, performance tuning).
Hands-on with Dynatrace, Prometheus, Grafana, Cloud Monitoring for observability.
Strong coding/scripting ability in Python, Go, or Shell for automation.
Deep understanding of incident management, chaos engineering, and reliability patterns.
Experience with PCI, security compliance, and high-traffic retail systems.
Preferred Skills
Experience with GenAI/ML for observability and anomaly detection.
Familiarity with Akamai CDN, Nginx ingress controllers, Redis/Memstore.
Strong knowledge of CI/CD practices (GitLab, Jenkins, ArgoCD).
Excellent communication, leadership, and stakeholder management skills.
What We Offer
Opportunity to lead Platform SRE for a top retail client with global presence.
Work on cutting-edge cloud-native and GenAI-powered reliability solutions.
A collaborative environment focused on innovation, automation, and resilience.
Thanks.