Incident & Request Manager

Overview

On Site

Accepts corp to corp applications

Contract - W2

Contract - Independent

Contract - Long Term

Skills

SRE

Incident Analysts

ITIL Incident

Job Details

Job Title: Incident & Request Manager - Non-Production Environments (Onsite)

Location: Atlanta ,GA

Role Overview

The Incident & Request Manager leads the incident response and request management functions for all non-production environments (Dev, QA, UAT, Performance). Acting as the escalation point for project and product delivery teams, this role ensures incidents are resolved quickly, requests are fulfilled efficiently, and learnings are embedded into continuous improvement processes.

This role directly manages a team of Incident Analysts and Site Reliability Engineers (SREs), partners with DevOps teams to automate detection and response, and collaborates closely with Environment and Change Managers to reduce issue recurrence.

Key Responsibilities

Incident Management

Own the full incident lifecycle: detection, triage, response, resolution, and closure.
Act as the primary escalation point for project/product delivery teams during non-production environment (NPE) incidents.
Lead war rooms for critical incidents, coordinating with technical and delivery stakeholders.
Ensure timely escalation to Environment, Change, DevOps, Infrastructure, and Security teams as needed.
Track and improve incident Service Level Agreements (SLAs), including Mean Time to Recovery (MTTR), Mean Time to Detect (MTTD), and availability Service Level Objectives (SLOs).

Request Management

Manage request fulfillment for project/product delivery teams (e.g., access, entitlements, environment service requests).
Standardize and automate common request types in collaboration with Intake and DevOps teams.
Ensure all requests are logged, prioritized, and fulfilled within defined SLAs.
Provide transparency and timely updates to stakeholders regarding request status.

Team Leadership

Manage and mentor Incident Analysts and SREs.
Ensure follow-the-sun coverage through offshore and onsite teams.
Foster a culture of blameless incident management, automation-first practices, and continuous learning.

Governance & Root Cause Analysis (RCA)

Ensure all incidents have documented Root Cause Analyses.
Track corrective and preventive actions, feeding outcomes into Change and Environment management processes.
Provide trend reporting and actionable insights to leadership.

SRE & DevOps Alignment

Collaborate with SREs and DevOps teams to automate incident detection, rollback, and recovery.
Integrate observability tools (e.g., Splunk, Prometheus, Grafana) into proactive monitoring frameworks.

Stakeholder Communication

Provide timely updates during incidents and delays in request fulfillment.
Publish regular reports on incident trends, RCA outcomes, and SLA adherence.
Maintain trust with project and product delivery teams through transparent communication.

Required Skills & Experience

8 10 years of experience in Incident Management, Service Operations, or SRE leadership roles.
Proven experience managing Incident Analysts and SRE teams.
Strong technical knowledge of AWS, Kubernetes, CI/CD pipelines, and observability tools such as Splunk, Prometheus, and Grafana.
Deep understanding of ITIL Incident, Problem, and Request Management frameworks.
Excellent crisis management, communication, and stakeholder engagement skills.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About Amaze Systems Inc

Share