We are seeking a highly skilled Site Reliability Engineer (SRE) with strong Golang development experience to improve the reliability, scalability, and performance of our production systems. This role combines software engineering, incident response, observability, and data analysis to build resilient platforms and automate operational excellence. You will develop tools and services that transform production incident data into actionable insights while driving reliability initiatives across cloud-native environments.
Key Responsibilities:
Develop and maintain reliability tooling and automation using Golang.
Participate in production incident response, troubleshooting, root cause analysis (RCA), and postmortem reviews.
Analyze incident and system performance data to identify trends and recommend reliability improvements.
Design and enhance observability solutions, including metrics, structured logging, distributed tracing, and alerting.
Build scalable automation to improve operational efficiency and reduce manual intervention.
Collaborate with software engineering teams to improve application reliability, performance, and resilience.
Manage and optimize Kubernetes-based production environments running on Google Cloud Platform (Google Cloud Platform).
Apply statistical techniques such as anomaly detection, regression analysis, and trend analysis to improve system health.
Communicate technical findings and business impact clearly to engineering and non-technical stakeholders.
Required Qualifications:
4+ years of experience in Site Reliability Engineering (SRE), DevOps, Platform Engineering, or Systems Engineering supporting large-scale production environments.
Strong hands-on experience with Golang (Go) is mandatory.
Strong SQL skills with experience performing data analysis on production and operational datasets.
Hands-on experience with Kubernetes and Google Cloud Platform (Google Cloud Platform).
Deep understanding of observability technologies, including monitoring, alerting, distributed tracing, structured logging, and metrics collection.
Experience with incident response, production support, root cause analysis, and reliability engineering best practices.
Knowledge of distributed systems, cloud-native architectures, and production operations.
Strong scripting and automation skills.
Excellent communication and documentation skills.
📩 Please share your updated resume at