Overview
Skills
Job Details
Hiring Now: Site Reliability Engineer (SRE) Core SRE Only | Atlanta, GA | Onsite
Location: Atlanta, GA
Note: Seeking only Core SRE professionals DevOps-only profiles will not be considered
Mandatory: Hands-on experience with Dynatrace
About the Role:
We re looking for an experienced Site Reliability Engineer (SRE) to join our client s team in Atlanta, GA. This role requires a strong background in Core SRE, not traditional DevOps, with expertise in Dynatrace and a deep understanding of building scalable, reliable, and observable systems on AWS.
Key Responsibilities:
Reliability Strategy & Observability:
Design scalable, secure, and cost-effective infrastructure on AWS
Define and implement SRE best practices, SLIs/SLOs, and Error Budgets
Identify and close observability gaps using Dynatrace, OpenTelemetry, etc.
Lead maturity improvements in monitoring and system health visibility
Platform Architecture & Automation:
Architect solutions that reduce operational toil through automation
Enhance CI/CD pipelines, IaC modules, and chaos engineering platforms
Research and recommend tools that improve reliability and efficiency
Technical Leadership:
Act as a technical advisor to development and platform teams
Ensure reliability principles are applied early in design ("shift-left")
Mentor engineers and lead production readiness assessments
Resilience & Incident Management:
Lead blameless postmortems and implement systemic improvements
Architect and enforce resilience patterns like circuit breakers and graceful degradation
Must-Have Qualifications:
Proven experience as an SRE Architect or similar leadership role
Strong hands-on with Dynatrace, AWS, and observability tooling (e.g., Prometheus, Grafana, ELK, Jaeger)
Deep knowledge of SLIs, SLOs, automation, incident response, and postmortems
Expertise in Kubernetes, Docker, and scripting (Python, Go, Bash)
Excellent communication and stakeholder management skills
Nice-to-Haves:
Experience implementing chaos engineering tools and practices
Exposure to serverless platforms and modern reliability frameworks