Job Description
We are looking for a highly experienced Google Cloud Platform Site Reliability Engineer (SRE) with 10+ years of overall IT experience and strong expertise in designing, automating, monitoring, and supporting cloud-native infrastructure on Google Cloud Platform (Google Cloud Platform). The ideal candidate should have deep hands-on experience with Kubernetes, Terraform, CI/CD pipelines, monitoring tools, and production support in highly scalable enterprise environments.
The candidate will work closely with DevOps, Development, Security, and Infrastructure teams to improve system reliability, automation, scalability, and operational excellence.
Required Skills & Experience
10+ years of IT experience with strong expertise in Cloud Infrastructure and Site Reliability Engineering
5+ years of hands-on experience with Google Cloud Platform (Google Cloud Platform)
Strong experience with:
Google Kubernetes Engine (GKE)
Compute Engine
Cloud Load Balancing
Cloud Storage
BigQuery
Pub/Sub
Cloud Functions
IAM
Cloud Monitoring & Logging
Strong experience in Kubernetes and container orchestration
Expertise in Infrastructure as Code (IaC) using Terraform
Strong scripting/programming skills in Python, Bash, or Shell scripting
Experience building and managing CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI/CD, or Azure DevOps
Experience with monitoring and observability tools such as:
Prometheus
Grafana
ELK Stack
Cloud Operations Suite
Experience implementing SRE principles:
SLI/SLO/SLA
Incident Management
Root Cause Analysis (RCA)
High Availability
Disaster Recovery
Capacity Planning
Experience with configuration management tools like Ansible
Strong understanding of networking concepts, DNS, Load Balancers, VPN, and security best practices
Experience with GitOps, ArgoCD, or MLOps is a plus
Familiarity with Agile/Scrum methodologies
Excellent communication and troubleshooting skills
Responsibilities
Design, deploy, and maintain scalable and reliable cloud infrastructure on Google Cloud Platform
Manage and support Kubernetes/GKE clusters in production environments
Automate infrastructure provisioning and deployments using Terraform and CI/CD pipelines
Implement monitoring, logging, alerting, and observability solutions
Ensure high availability, scalability, reliability, and security of cloud platforms
Troubleshoot production incidents and perform root cause analysis
Optimize cloud infrastructure cost, performance, and resource utilization
Collaborate with development and DevOps teams to improve deployment reliability
Implement backup, disaster recovery, and failover strategies
Define and maintain SLOs, SLIs, and operational best practices
Participate in on-call support and incident response activities
Preferred Qualifications
Google Cloud Platform Professional Cloud DevOps Engineer Certification
Google Cloud Platform Professional Cloud Architect Certification
Kubernetes Certifications (CKA/CKAD) preferred
Experience with multi-cloud environments (AWS/Azure) is a plus
Keywords
Google Cloud Platform, SRE, Site Reliability Engineer, GKE, Kubernetes, Terraform, CI/CD, DevOps, Python, Cloud Monitoring, Prometheus, Grafana, Jenkins, GitHub Actions, Cloud Operations, IAM, Pub/Sub, BigQuery, Cloud Functions, Infrastructure Automation, Production Support, Observability, Incident Management, GitOps, ArgoCD, Ansible