Overview
Skills
Job Details
We are seeking a highly skilled Site Reliability Engineer (SRE) with strong observability expertise, proven communication skills, and the ability to drive reliability maturity across multi-team environments. This role is ideal for someone who can blend deep technical proficiency with strategic thinking and collaborative influence.
Key Responsibilities
Observability Engineering
• Design, scale, optimize, and manage Prometheus and Grafana environments.
• Write advanced PromQL queries, dashboards, visualizations, and metric-based calculations.
• Build out and maintain Grafana instances, supporting multi-team use cases.
• Leverage Dynatrace with strong proficiency in metrics and analytics to deliver efficient, actionable observability solutions for engineering and operations teams (e.g., dashboards, insights, reports).
• Analyze telemetry data to identify the metrics that matter (MTM), drive actionable insights, and influence engineering decisions.
Site Reliability Engineering
• Apply and evolve an SRE Maturity Model to help teams mature across observability, resilience, automation, and reliability.
• Establish, implement, and maintain Service Level Objectives (SLOs) and error budgets across applications and services.
• Partner effectively with engineering, product, operations, and leadership teams; translate complex technical insights into clear, actionable communication.
• Identify and reduce toil through automation, tooling improvements, and process refinement.
• Support incident analysis, reliability reviews, and continuous improvement initiatives.
Required Skills & Experience
• Familiarity with SRE principles, maturity models, and reliability roadmaps.
• Demonstrated experience improving application reliability via data-driven decisions.
• Hands-on experience with Prometheus, Grafana, and PromQL.
• Strong understanding of Dynatrace, metric analysis, and observability practices.
• Excellent communication skills and ability to collaborate across diverse technical and non-technical teams.
• Strong analytical and problem-solving skills with a bias for action.
Nice to Have
Experience with Kubernetes, cloud platforms (AWS/Google Cloud Platform/Azure), or CI/CD pipelines.
Experience with Automation
Experience with large-scale distributed systems or high-availability architectures.