Job Title: DevOps Engineer
Location: Atlanta, GA, Birmingham, AL, Louisville, KY, Richmond, VA, Charlotte, NC
Can do Only W2, No C2C
Job Summary:
We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management, Change Control, Error Budgeting, Remediation, and Production Operations. The ideal candidate will be responsible for ensuring the reliability, scalability, performance, and operational excellence of cloud-native platforms and distributed systems. This role requires deep expertise in cloud infrastructure, automation, observability, incident response, and operational governance.
Key Responsibilities:
- Manage and improve platform reliability, availability, and performance across production environments.
- Lead and participate in incident management, root cause analysis, remediation planning, and post-incident reviews.
- Drive change control processes and ensure operational governance standards are followed.
- Monitor and manage error budgets while implementing reliability improvements.
- Design, build, and maintain scalable cloud infrastructure and automation frameworks.
- Deploy and manage containerized applications using Kubernetes and Docker.
- Develop and maintain CI/CD pipelines to support efficient software delivery.
- Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management.
- Establish observability strategies using monitoring, logging, and alerting platforms.
- Collaborate with development, infrastructure, security, and business teams to ensure platform stability.
- Troubleshoot complex production issues across cloud, networking, infrastructure, and application layers.
- Continuously improve operational processes, automation, and system resilience.
Required Skills:
- 7+ years of experience in Site Reliability Engineering (SRE), DevOps, Cloud Infrastructure, or Production Operations.
- Strong experience managing workloads in cloud environments:
- Microsoft Azure
- Amazon Web Services (AWS)
- Google Cloud Platform (Google Cloud Platform)
- Hands-on experience with:
- Kubernetes
- Docker
- CI/CD Pipelines
- Infrastructure as Code (IaC)
- Strong scripting and automation expertise using:
- Python
- Bash
- PowerShell
- Go (Golang)
- Experience with observability and monitoring platforms:
- Datadog
- Grafana
- Prometheus
- Splunk
- Strong understanding of:
- Networking concepts
- Linux Administration
- Windows Administration
- Distributed Systems
- Cloud-Native Architectures
- Experience with:
- Incident Response
- Production Troubleshooting
- Operational Governance
Preferred Qualifications:
- Experience implementing reliability engineering best practices and SRE methodologies.
- Experience supporting large-scale enterprise production environments.
- Familiarity with high-availability and disaster recovery architectures.
- Experience automating operational workflows and infrastructure management.
- Knowledge of security best practices within cloud environments.
- Experience working in Agile and DevOps-driven organizations.
Mandatory Skills:
Site Reliability Engineering (SRE), Incident Management, Change Control, Error Budgeting, Production Remediation, Microsoft Azure, AWS, Google Cloud Platform, Kubernetes, Docker, CI/CD Pipelines, Infrastructure as Code (IaC), Python, Bash, PowerShell, Go (Golang), Datadog, Grafana, Prometheus, Splunk, Linux Administration, Windows Administration, Networking, Distributed Systems, Cloud-Native Architectures, Production Troubleshooting, Operational Governance
Best Regards:
Tanuja P
Phone:
Email: