Location / Remote: Remote (must live within the continental U.S.); quarterly travel to Atlanta, GA required
Employment Type: Indefinite W-2 or 1099/IC contract (will extend annually)
Compensation: Up to $55/hour W-2 or up to $62/hour 1099/IC (commensurate with experience)
Benefits: Medical, dental, vision, LTD/STD, HSA/FSA, term life, and optional supplemental insurance coverage available for W-2 employees (including family coverage if needed)
Job Summary:
We are seeking a Senior Problem & Incident Manager to serve as the central orchestration point for enterprise-wide incident response and problem management. This role is highly technical and process-driven, responsible for leading major incidents, driving structured root cause analysis, and strengthening operational governance across infrastructure, cloud, identity, and application environments.
Success in this role is measured by reduced repeat incidents, improved mean time to resolution (MTTR), stronger change controls, and clear executive-level communication during high-impact outages. This individual will operate as an Incident Commander during major events and will help formalize Problem Management processes in a growing, evolving IT environment.
Responsibilities:
- Lead enterprise-wide incident management for Severity 1 3 incidents across infrastructure, cloud, identity, security, and applications.
- Act as Incident Commander during major outages, coordinating cross-functional technical teams and vendors to restore services quickly and effectively.
- Conduct structured Root Cause Analysis (RCA) and post-incident reviews (PIRs) using formal methodologies (e.g., 5 Whys, Fishbone, Fault Tree).
- Own the Problem Management lifecycle, identifying recurring issues and driving permanent corrective actions.
- Partner with infrastructure, cloud, application, and security teams to reduce change-related incidents and improve production readiness.
- Participate in daily operational readouts and provide executive-ready communications regarding system health, impact analysis, and resolution status.
- Track and report on key operational metrics, including MTTR, MTTA, repeat incident rates, change-related incidents, and problem backlog aging.
- Support and mature ITSM processes, including incident, problem, and change management workflows.
- Help define and document SOPs, playbooks, escalation paths, and governance standards in a developing process environment.
- Ensure operational risks are identified, communicated, and mitigated in alignment with compliance and security expectations.
Required Skills & Experience:
- 8+ years of progressive experience in IT operations, infrastructure, cloud operations, or enterprise production environments.
- Proven experience leading major incident response efforts in enterprise environments.
- Hands-on experience conducting structured Root Cause Analysis and post-incident reviews.
- Strong technical background in enterprise infrastructure (networking, compute, storage, virtualization).
- Experience supporting Azure and/or AWS cloud environments.
- Strong familiarity with Microsoft 365 and identity platforms (Active Directory, Entra ID, MFA, SSO).
- Working knowledge of security tooling (SIEM, EDR, vulnerability management) and monitoring/observability platforms.
- Experience with ITSM platforms (e.g., EasyVista, ServiceNow, Jira Service Management, BMC, or similar).
- Strong written and verbal communication skills, with the ability to deliver executive-level incident communications.
- Demonstrated ability to operate independently, prioritize effectively, and manage high-pressure situations.
Preferred Qualifications:
- Experience building or maturing Problem Management processes from the ground up.
- ITIL certification or strong familiarity with ITIL concepts.
- Experience correlating change management events to incident trends.
- Background in regulated or compliance-driven industries (e.g., utilities, energy, finance).
- Experience working in environments with identity-heavy or access-controlled systems.