Site Reliability Engineer

Overview

On Site
$50 - $60
Contract - W2
Contract - Independent
Contract - 6 Month(s)

Skills

Dynatrace
Site Reliability Engineer (SRE
calable
reliable
AWS
Prometheus
Grafana
ELK
Jaeger)

Job Details

Hiring Now: Site Reliability Engineer (SRE) Core SRE Only | Atlanta, GA | Onsite

Location: Atlanta, GA

Note: Seeking only Core SRE professionals DevOps-only profiles will not be considered

Mandatory: Hands-on experience with Dynatrace


About the Role:

We re looking for an experienced Site Reliability Engineer (SRE) to join our client s team in Atlanta, GA. This role requires a strong background in Core SRE, not traditional DevOps, with expertise in Dynatrace and a deep understanding of building scalable, reliable, and observable systems on AWS.


Key Responsibilities:

Reliability Strategy & Observability:

  • Design scalable, secure, and cost-effective infrastructure on AWS

  • Define and implement SRE best practices, SLIs/SLOs, and Error Budgets

  • Identify and close observability gaps using Dynatrace, OpenTelemetry, etc.

  • Lead maturity improvements in monitoring and system health visibility

Platform Architecture & Automation:

  • Architect solutions that reduce operational toil through automation

  • Enhance CI/CD pipelines, IaC modules, and chaos engineering platforms

  • Research and recommend tools that improve reliability and efficiency

Technical Leadership:

  • Act as a technical advisor to development and platform teams

  • Ensure reliability principles are applied early in design ("shift-left")

  • Mentor engineers and lead production readiness assessments

Resilience & Incident Management:

  • Lead blameless postmortems and implement systemic improvements

  • Architect and enforce resilience patterns like circuit breakers and graceful degradation


Must-Have Qualifications:

  • Proven experience as an SRE Architect or similar leadership role

  • Strong hands-on with Dynatrace, AWS, and observability tooling (e.g., Prometheus, Grafana, ELK, Jaeger)

  • Deep knowledge of SLIs, SLOs, automation, incident response, and postmortems

  • Expertise in Kubernetes, Docker, and scripting (Python, Go, Bash)

  • Excellent communication and stakeholder management skills


Nice-to-Haves:

  • Experience implementing chaos engineering tools and practices

  • Exposure to serverless platforms and modern reliability frameworks

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.