Job Description
Position: Site Reliability Engineer (SRE)
Role Summary
We are looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, availability, performance, and scalability of critical systems. The SRE will work closely with development and operations teams to build resilient infrastructure, automate operations, and improve system observability while maintaining strong SLAs/SLOs.
Key Responsibilities
• Design, build, and maintain highly available, scalable, and reliable systems.
• Define and manage SLIs, SLOs, and SLAs to ensure system reliability and performance.
• Automate infrastructure provisioning and configuration using Infrastructure as Code (Terraform, CloudFormation).
• Implement and manage CI/CD pipelines to enable safe and frequent deployments.
• Monitor system health using tools like Prometheus, Grafana, Datadog, Splunk, ELK.
• Handle incident response, on-call rotations, root cause analysis (RCA), and post-mortems.
• Improve system resilience through capacity planning, load testing, and chaos engineering.
• Collaborate with engineering teams to improve application reliability and reduce operational toil.
• Manage cloud infrastructure on AWS / Azure / Google Cloud Platform.
• Ensure system security, compliance, and best practices are followed.
• Support production deployments, upgrades, and performance tuning.
Required Skills & Experience
• 3+ years of experience as an SRE / DevOps / Production Engineer.
• Strong knowledge of Linux/Unix systems and networking fundamentals.
• Proficiency in scripting or programming (Python, Go, Bash).
• Experience with containers and orchestration (Docker, Kubernetes).
• Hands-on experience with monitoring, logging, and alerting tools.
• Strong understanding of cloud platforms (AWS, Azure, or Google Cloud Platform).
• Experience implementing high availability, fault tolerance, and disaster recovery strategies.
• Excellent problem-solving and troubleshooting skills.




