Overview
Skills
Job Details
1. Technical Expertise
Deep understanding of SRE principles, SRE model, and DevOps methodologies.
Experience designing highly available, scalable, and resilient distributed systems.
Proficient in architectural design (Microservices, Cloud-native, Event-driven architecture).
Skilled in cloud platforms: Azure, Google Cloud Platform.
Strong knowledge of observability tools: UIM, Prometheus, Grafana, Datadog, New Relic, Splunk, AppDynamics.
2. Framework Design & Governance
Define and validate SLOs, SLIs, SLAs, error budgets, and availability targets.
Design runbooks, escalation policies, and chaos testing frameworks.
Create reusable templates for observability, alerting, and logging.
Ensure compliance and audit readiness.
3. Communication & Cross-Functional Leadership
Collaborate with architects, designers, platform and infra teams.
Document frameworks and lead adoption across teams.
Review designs and validate reliability criteria.
Roles & Responsibilities:
1. Framework & Standardization
Define and maintain the SRE operating model, framework, and onboarding guide.
Create templates and reference architectures for observability, alerting, and runbooks.
Standardize definitions of availability, reliability, latency, and performance.
2. Architectural Integration
Participate in application architecture reviews to validate SRE compliance.
Recommend design patterns for fault tolerance, failover, auto-scaling, and DR.
Define observability-by-design principles.
3. Governance, Audit & Optimization
Establish and lead SRE councils or review boards.
Define SRE maturity models, scorecards, and compliance checks.
Perform SRE audits across product portfolios.
Guide teams on capacity modeling, load distribution, and cost-efficiency strategies.
Collaborate with platform teams on resource reservations and right-sizing.
4. Tool Rationalization & Strategy
Evaluate and recommend standard SRE toolchains for monitoring, logging, tracing.
Own the integration strategy across observability platforms.
5. Training, Leadership & Evangelism
Conduct SRE bootcamps for application and infra teams.
Champion a blameless culture and continuous improvement mindset.
Drive Error Budget policies and reliability trade-off discussions.
Mentor product teams on SRE integration strategies.
Influence architectural decisions with SRE perspectives.