Overview
Skills
Job Details
Location: Atlanta GA
Job Description:-
-
Define and implement the SRE architecture, reliability framework, and operational strategy.
-
Design scalable, fault-tolerant systems for high availability and disaster recovery.
-
Establish SLOs, SLIs, and SLAs across services and ensure compliance.
-
Architect systems for observability: logging, tracing, metrics, and alerting.
-
Drive automation for infrastructure, deployment, and monitoring using tools like:
-
Terraform, Ansible, Helm, Kubernetes Operators
-
CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI, ArgoCD)
-
-
Automate manual processes to improve efficiency and reduce MTTR.
-
Develop self-healing mechanisms and automated remediation workflows.
-
Lead cloud architecture design on AWS, Azure, or Google Cloud Platform.
-
Architect and optimize Kubernetes clusters and containerized applications.
-
Implement and manage scaling strategies, load balancing, and failover designs.
-
Oversee network reliability, security, and configuration management.
-
Implement observability tools like:
-
Prometheus, Grafana, ELK, Datadog, New Relic, Splunk
-
-
Lead incident management processes by identifying root causes and improving system resilience.
-
Conduct performance testing, capacity planning, and SLA compliance reporting.
-
Partner with software engineering, DevOps, security, and product teams.
-
Mentor the SRE team and promote reliability engineering best practices.
-
Establish playbooks, runbooks, and operational documentation.
-
8 12+ years of experience in SRE, DevOps, or infrastructure engineering.
-
Strong hands-on experience with:
-
Kubernetes & Docker
-
Cloud platforms (AWS/Azure/Google Cloud Platform)
-
IaC tools (Terraform/CloudFormation)
-
CI/CD systems
-
-
Deep understanding of:
-
Distributed systems
-
Reliability and performance engineering
-
Observability tools
-
Incident & problem management
-
-
Experience in scripting/programming (Python, Go, Shell, etc.).
-
Strong troubleshooting, analytical, and architectural skills.