Overview
Skills
Job Details
Google Cloud Platform Site Reliability Engineer
Location: Atlanta, GA
Key Responsibilities & Focus Areas:
Google Cloud Platform Expertise: This role heavily emphasizes deep knowledge and hands-on experience with Google Cloud Platform services. Specific mentions include:
BigQuery
Cloud Logging
IAM (Identity and Access Management)
Service Accounts
Provisioning and monitoring cloud services (staging/production)
Deploying, maintaining, and troubleshooting cloud services
SRE Principles & Practice: A core component of this role is the practical application of Site Reliability Engineering principles:
SLIs (Service Level Indicators)
SLOs (Service Level Objectives)
Error Budgets
Toil Reduction
Automation
Incident Management
Postmortems
Architectural Design: Proven experience in designing reliable, scalable, and high-performing solutions is crucial.
Cloud Infrastructure & Technologies:
Comprehensive understanding of cloud computing platforms (Google Cloud Platform specifically), including infrastructure, networking, and security services.
Strong experience with containerization and orchestration (Kubernetes, Docker, serverless computing).
Observability: Designing and implementing robust observability solutions is a key skill, with experience in tools like:
Dynatrace
Prometheus
Grafana
ELK/EFK Stack (Elasticsearch, Logstash, Kibana/Fluentd)
Programming & Scripting: Strong skills in languages like Python, Go, and Bash are required for automation and tool development.
Problem-Solving & Leadership: The role demands excellent analytical, problem-solving, and strategic thinking skills, along with strong communication, collaboration, and leadership abilities to influence technical direction.
On-Call: Expect to be part of an on-call rotation.
Experience Required:
6+ years in systems engineering, platform support, DevOps, or site reliability roles.
Overall:
This is a senior-level SRE position requiring a strong blend of hands-on technical expertise in Google Cloud Platform, a deep understanding of SRE methodologies, and architectural design capabilities. The ideal candidate will be proficient in automation, observability, and incident response, with a commitment to building highly reliable and scalable systems.