Overview
Skills
Job Details
Position: Site Reliability Engineer (SRE)
Experience: 9+ years
About the Role
We are looking for a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will bridge the gap between development and operations, ensuring our systems are scalable, reliable, and secure. You will be responsible for designing, automating, and monitoring critical infrastructure, improving application performance of our services.
Key Responsibilities
Build, maintain, and scale cloud infrastructure (AWS/Azure/Google Cloud Platform) with high availability and resilience.
Implement automation and Infrastructure-as-Code (IaC) using tools like Terraform, Ansible, or CloudFormation.
Monitor system performance, availability, and reliability using Prometheus, Grafana, ELK, Splunk, or Datadog.
Develop CI/CD pipelines (Jenkins, GitHub Actions, Azure DevOps, GitLab CI).
Manage incident response, on-call rotations, root cause analysis (RCA), and postmortems.
Optimize system reliability, latency, and scalability across distributed systems.
Ensure security, compliance, and disaster recovery strategies are in place.
Collaborate with DevOps, Developers, and QA teams to ensure efficient release cycles.
Drive SLOs, SLIs, and SLAs definition and implementation to measure and improve service health.
Troubleshoot production issues across services and infrastructure.
Required Skills & Qualifications
Bachelor s degree in Computer Science, Engineering, or equivalent experience.
9+ years of experience in SRE, DevOps, or Cloud Infrastructure roles.
Strong expertise in Linux/Unix administration and scripting (Python, Bash, Go, or Shell).
Hands-on experience with Kubernetes, Docker, and microservices architectures.
Proficiency in cloud platforms (AWS, Azure, or Google Cloud Platform).
Experience with observability and monitoring tools (Prometheus, Grafana, ELK, New Relic, Datadog).
Familiarity with networking, DNS, load balancers, and CDN technologies.
Strong understanding of CI/CD pipelines and Git-based workflows.
Experience with incident management, chaos engineering, and resilience testing.