Overview
Skills
Job Details
Job Summary:
We are seeking an experienced Observability and Outage Response Manager to lead the detection, response, and resolution of system outages and performance anomalies. This role is crucial in ensuring service reliability, minimizing downtime, and continuously improving incident management processes. You will also oversee defect tracking, anomaly detection, and the Root Cause Corrective Action (RCCA) process.
Key Responsibilities:
Oversee and manage response activities for outages, ensuring timely resolution and minimal impact on business operations.
Lead the Root Cause Corrective Action (RCCA) process to identify underlying causes and implement long-term fixes.
Communicate clearly and regularly with stakeholders during outage events, providing updates, coordinating remediation efforts, and setting expectations.
Develop, implement, and maintain robust observability frameworks to monitor system health and proactively detect anomalies.
Drive continuous improvement in outage and incident response strategies by collaborating with cross-functional engineering, QA, and support teams.
Conduct comprehensive post-incident reviews, document findings, and ensure lessons learned are integrated into existing processes.
Manage defect tracking and analysis workflows, ensuring defects are logged, prioritized, and resolved efficiently.
Train and mentor team members on outage management processes, observability best practices, and problem-solving techniques.
Evaluate and integrate observability tools, dashboards, and alerting mechanisms to enhance system visibility.
Qualifications:
Bachelor s degree in Computer Science, Information Technology, or a related field.
Proven experience in outage management, incident response, defect management, and anomaly detection.
Strong knowledge of observability tools and methodologies (e.g., Prometheus, Grafana, Splunk, Datadog, New Relic, etc.).
Demonstrated ability to conduct root cause analysis and implement corrective actions effectively.
Exceptional organizational, communication, and leadership skills.
Ability to manage multiple high-priority incidents under pressure.
Familiarity with ITIL or other formal incident and problem management frameworks is a strong plus.
Preferred Skills:
Experience in high-availability, distributed systems environments.
Hands-on knowledge of monitoring and alerting systems.
Understanding of SLAs, SLOs, and reliability metrics.
Background in site reliability engineering (SRE) is a plus.