Overview
Skills
Job Details
Required Skills & Experience
8 10 years in Incident Management, IT Operations, or SRE leadership.
Experience managing teams (Incident Analysts, SREs).
Strong knowledge of AWS, Kubernetes, CI/CD pipelines, and observability tools like Splunk, Prometheus, or Grafana.
Deep familiarity with ITIL Incident, Problem, and Request management processes.
Excellent crisis handling, communication, and stakeholder management skills.
Own the full lifecycle of incidents in non-production: detection triage resolution closure.
Be the escalation point when delivery teams run into problems.
Lead war rooms for major incidents, coordinating with DevOps, Infra, Security, and other teams.
Ensure incidents escalate properly to all relevant teams.
Track and improve SLAs / metrics like Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), and environment availability.