Experience: 10+ years in IT Operations / NOC / Major Incident Management, including leadership ownership.
Role Summary:
The Major Incident Management & NOC Lead is responsible for end-to-end command and control of the enterprise s 24x7 operational monitoring and incident response. This role leads the MIM and NOC function, drives Major Incident (P1/P2) execution, ensures rapid service restoration, and continuously improves operational maturity through problem management, automation, observability enhancements, and SLA governance.
This role requires a mix of strong incident leadership, technical depth across infrastructure and applications, and people/process management to ensure stability, availability, and performance across critical services.
Key Responsibilities:
A) Major Incident Management (Command & Control)
Own the Major Incident (P1/P2) process from detection to resolution, including war-room leadership, stakeholder updates, and closure.
Act as the Incident Commander and ensure structured triage, containment, workaround, and restoration.
Drive cross-functional coordination (App, Infra, Network, Security, DB, Cloud, Vendor teams) to reduce MTTR.
Ensure high-quality incident communications: executive summaries, impact analysis, ETAs, customer/business comms.
Lead and facilitate Post Incident Reviews (PIR/RCA); ensure actionable corrective/preventive actions (CAPA).
Identify recurring issues and trigger Problem Management with measurable reduction plans.
B) NOC Leadership & Operations
Lead the NOC team responsible for 24x7 monitoring, alert triage, event correlation, escalation, and ticket quality.
Establish/maintain standard operating procedures (SOPs), runbooks, escalation matrices, and on-call models.
Ensure NOC meets SLAs/OLAs, improves alert fidelity, and reduces noise through tuning and automation.
Manage handover governance between shifts; maintain service continuity and operational hygiene.
C) Service Reliability & Continuous Improvement
Drive operational improvements: monitoring coverage, SLO/SLA alignment, incident prevention, and resiliency initiatives.
Partner with Engineering/Platform teams on observability strategy, proactive detection, and reliability patterns.
Track and report operational metrics: MTTD, MTTR, incident volume, re-open rate, SLA compliance, and trends.
Support readiness for audits and compliance: evidence collection, process adherence, and risk mitigation.
D) Stakeholder & Vendor Management
Interface with business stakeholders, service owners, and leadership to provide incident status, risk, and remediation plans.
Manage vendor escalations and ensure timely resolution aligned to contractual SLAs.
E) Managerial / Leadership Skills (Must Have)
Proven experience leading MIM & NOC Operations teams (shift-based or on-call models).
Strong Incident Commander capability: calm under pressure, structured decision-making, priority trade-offs.
Excellent stakeholder management across technical teams and business leadership.
Ability to build and enforce process discipline (ITIL-aligned), while improving speed and quality.
Strong coaching/mentoring: performance management, skill development, hiring support as needed.
Effective communication: concise executive updates, clear action plans, facilitation of PIR/RCA sessions.
Data-driven mindset: uses metrics and trend analysis to drive operational outcomes.
Technical Skills (Must Have):
A) Monitoring / Observability
Hands-on experience with NOC tooling and observability platforms such as:
Splunk / ELK, Datadog, Dynatrace, New Relic, AppDynamics
PrometheGrafana, CloudWatch/Azure Monitor
Strong understanding of event correlation, alert tuning, noise reduction, and dashboarding.
B) Incident / ITSM Platforms
Strong working knowledge of ServiceNow (Incident, Problem, Change, Knowledge, CMDB) or equivalent ITSM tools.
Experience designing workflows, SLAs/OLAs, routing rules, and automation integrations.
C) Infrastructure & Platform Breadth
Solid understanding across:
Windows/Linux administration basics
Network fundamentals (DNS, DHCP, TCP/IP, routing, load balancers, firewalls)
Compute/virtualization (VMware/Hyper-V) and storage concepts
Databases fundamentals (SQL/Oracle, replication, performance symptoms)
Cloud fundamentals and operational support for AWS/Azure/Google Cloud Platform:
IAM basics, networking (VPC/VNet), scaling, logging/monitoring, common failure patterns.
D) Automation & Scripting (Good to Have / Preferred)
Scripting knowledge: PowerShell / Python / Bash
Familiarity with automation tools: Ansible, Terraform, CI/CD operational workflows.
Ability to create/maintain runbook automation and self-healing patterns.
E) Security & Resilience (Preferred)
Awareness of security operations touchpoints: DDoS symptoms, certificate expiries, IAM issues, endpoint/EDR alerts.
Familiarity with BCP/DR processes, failover testing, and resilience design collaboration.
F) ITIL / Process Expectations
Strong ITIL understanding across Incident, Problem, Change, Knowledge, and Service Level Management.
Ability to implement governance around:
Change risk assessment, change windows, incident-change correlation
RCA quality, action item tracking, and effectiveness validation
Qualifications:
Bachelor s degree in computer science / IT / Engineering or equivalent experience.
ITIL v4 Foundation (preferred).
Cloud certifications (preferred): AWS/Azure fundamentals or associate level.
Experience in enterprise production environments with stringent availability requirements.
Success Metrics / KPIs
Reduced MTTD and MTTR for P1/P2 incidents.
Improved SLA compliance and reduction in escalation breaches.
Reduced repeat incidents via problem management and preventive actions.
Improved alert quality: lower false positives, better signal-to-noise ratio.
Strong PIR/RCA compliance: on-time RCAs with measurable preventive outcomes.
Improved NOC operational maturity: SOP adherence, shift handover quality, audit readiness.
Nice-to-Have Industry Contexts
Transportation / financial services / healthcare / e-commerce / SaaS environments with high availability targets.
Experience supporting microservices, Kubernetes, and distributed systems.