Overview
Skills
Job Details
As a Cloud Infrastructure Site Reliability Engineer (SRE) with expertise across multiple public cloud platforms, you will be responsible for managing and operating cloud infrastructure in alignment with the principles of Google s SRE model. Your role will focus on ensuring the reliability, availability, and performance of our cloud services, while driving automation and continuous improvement across production environments. You will collaborate closely with cross-functional teams to strengthen our cloud reliability posture and streamline operations through innovative automation solutions.
Key Responsibilities:
Design, build, and maintain highly available, scalable, and secure cloud infrastructure on platforms such as AWS, Google Cloud Platform, or Azure.
Develop and implement automation for provisioning, monitoring, scaling, and incident response using Infrastructure-as-Code tools (e.g., Terraform, CloudFormation, Ansible). Monitor system reliability, capacity, and performance; proactively detect and address issues before they impact users.
Respond to production incidents, participate in on-call rotations, and lead post-incident reviews to drive root cause analysis and reliability improvements.
Collaborate with software engineering and security teams to ensure new services and features are production-ready and meet reliability standards.
Build and maintain tools for deployment, monitoring, and operations; automate manual processes to reduce toil.
Document operational processes and system architectures to ensure knowledge sharing and repeatability.
Continuously evaluate and implement new technologies to improve system reliability, security, and efficiency.