Program Manager SRE Stability Initiatives & Content Strategy || Onsite
Phoenix, AZ, US • Posted 1 day ago • Updated 1 day ago

Value Spectrum Technologies LLC
Dice Job Match Score™
🫥 Flibbertigibetting...
Job Details
Skills
- Stability Initiatives
- Program Management
- Site Reliability Engineering
- production stability
- enterprise infrastructure
- incident management
- SLIs
- SLOs
- SLAs
- error budgets
- major incident management (P1/P2)
- Root Cause Analysis (RCA)
Summary
We are seeking an experienced and highly motivated Program Manager to lead and drive Site Reliability Engineering (SRE) stability initiatives, operational excellence programs, and technical content strategy for a large-scale enterprise banking environment. This role requires strong expertise in managing reliability programs, improving system stability, driving incident reduction initiatives, and creating clear, high-quality technical and executive-level communications.
The ideal candidate will possess a strong understanding of SRE principles, production stability, enterprise infrastructure, incident management, and observability, combined with excellent content writing, documentation, and stakeholder communication skills. This individual will act as a bridge between engineering, operations, leadership, and business stakeholders.
Key Responsibilities
Program Management & Stability Initiatives
Lead enterprise-wide SRE stability programs focused on improving platform reliability, resiliency, and performance.
Define and track key reliability metrics such as SLIs, SLOs, SLAs, error budgets, availability, and MTTR.
Develop and execute stability roadmaps, reliability improvement initiatives, and risk mitigation strategies.
Coordinate cross-functional teams including SRE, DevOps, infrastructure, application engineering, and operations teams.
Identify reliability gaps and implement proactive measures to prevent incidents and outages.
Track and report program progress, risks, dependencies, and outcomes to leadership.
Incident Management & Operational Excellence
Lead and coordinate major incident management (P1/P2) processes.
Drive Root Cause Analysis (RCA) programs and ensure preventive actions are implemented.
Establish best practices for incident response, escalation, and communication.
Promote automation, observability, and operational efficiency initiatives.
Support continuous improvement through post-incident reviews and reliability engineering practices.
Content Writing & Communication
Develop high-quality technical documentation, executive reports, stability dashboards, runbooks, and operational playbooks.
Create clear and concise communications for both technical and non-technical stakeholders.
Prepare executive-level summaries, program updates, and reliability reports.
Document incident reports, RCAs, and stability improvement plans.
Maintain knowledge bases, operational procedures, and best practices documentation.
Stakeholder & Executive Management
Act as the primary point of contact between engineering teams and executive leadership.
Provide regular updates to senior leadership on stability metrics, risks, and improvement initiatives.
Facilitate executive reviews, operational reviews, and governance meetings.
Ensure alignment between engineering, operations, and business objectives.
Required Qualifications
10+ years of experience in Program Management, Technical Program Management, or SRE Program Management
Strong understanding of Site Reliability Engineering (SRE) principles and production operations
Experience managing enterprise stability, reliability, and incident management programs
Proven experience in banking, financial services, or highly regulated environments
Excellent technical content writing, documentation, and communication skills
Experience working with cross-functional technical and business teams
Strong leadership, coordination, and stakeholder management skills
Required Technical Knowledge
SRE concepts: SLI, SLO, SLA, Error Budgets, MTTR, Availability, Reliability Engineering
Incident Management and Root Cause Analysis processes
Observability and monitoring tools such as:
Dynatrace
Splunk
Datadog
New Relic
Prometheus / Grafana
Cloud Platforms:
AWS, Azure, or Google Cloud
DevOps & Infrastructure Concepts:
CI/CD pipelines
Kubernetes / Containers
Infrastructure automation
Key SkillsProgram Management
Site Reliability Engineering (SRE)
Stability & Reliability Initiatives
Incident Management & RCA
Technical Content Writing
Executive Communication
Stakeholder Management
Banking Domain Knowledge
Observability & Monitoring
Operational Excellence
- Dice Id: 91165686
- Position Id: 8895625
- Posted 1 day ago
Company Info
About Value Spectrum Technologies LLC
Step into a future defined by empowerment at Value Spectrum Technologies. With leading-edge software solutions and strategic consulting, were dedicated to shaping and elevating your digital tomorrow. Experience the synergy of innovation and collaboration as we unlock unparalleled opportunities for growth in the dynamic landscape of technology. Welcome to empowerment.
Join us in navigating the ever-evolving digital landscape with confidence, as we work together to unlock unprecedented opportunities and build a tomorrow that is truly empowered by the limitless possibilities of technology. Your digital future starts here.


Similar Jobs
It looks like there aren't any Similar Jobs for this job yet.
Search all similar jobs