Role Summary
· We are seeking a Senior Manager of Site Reliability Engineering (SRE) to help drive the activation, structure, and scaling of SRE practices across the Financial Services & Innovation (FS&I) organization.
· This role is responsible for establishing operational discipline, driving adoption of SRE standards, and aligning application teams, Production Support Engineering (PSE), and platform teams to a consistent reliability model.
· The ideal candidate brings a combination of technical depth, organizational leadership, and execution rigor, with proven experience implementing SRE practices in complex enterprise environments.
Key Responsibilities
SRE Activation & Operating Model
- Drive adoption of the SRE operating model across application teams
- Establish clarity in roles between:
- Production Support Engineering (PSE)
- Ensure SRE practices are embedded into the development lifecycle, not treated as post-production activities
Reliability Standards & Governance
- SLIs, SLOs, and Error Budgets
- Production readiness criteria
- Reliability best practices
- Lead SLO adoption and compliance reviews across the organization
- Establish governance frameworks to ensure consistent application of standards
Cross-Team Coordination & Enablement
- Application product teams
- Production Support Engineering (MG team)
- Platform / Infrastructure / Observability teams
- Drive alignment and reduce friction between engineering and operations
- Ensure clear handoffs, escalation models, and operational ownership
Observability & Monitoring Strategy
- Lead adoption of centralized observability standards across:
- Align tooling (AppDynamics, Splunk, Prometheus, etc.)
- Ensure monitoring and alerting are SLO-driven and actionable, not noise-based
Incident Management & Continuous Improvement
- Partner with PSE to strengthen:
- Incident management processes
- RCA (Root Cause Analysis) standards
- Drive identification of patterns and systemic issues
- Ensure learnings translate into engineering improvements and automation
Automation & Reliability Engineering
- Identify opportunities to:
- Reduce manual operational work
- Improve system resilience
- Enable self-healing capabilities
- Promote a culture of engineering over reaction
Reporting & Organizational Insight
- Define and track reliability metrics across FS&I
- Build reporting that provides visibility into:
- Translate technical data into actionable business insights
Required Qualifications
- 10+ years in engineering, operations, or SRE roles
- 5+ years leading SRE, platform, or reliability-focused teams
- Proven experience implementing SRE practices at scale (SLIs, SLOs, error budgets)
- Strong background in cloud environments (AWS, Azure, Google Cloud Platform)
- Hands-on experience with observability tools (Splunk, AppDynamics, Prometheus, etc.)
- Experience in incident management and production operations at scale
- Ability to operate effectively in high-pressure and complex enterprise environments
Preferred Qualifications
- Experience driving organizational transformation (not just technical implementation)
- Strong understanding of CI/CD, DevOps, and automation practices
- Experience working in regulated or large enterprise environments
- Familiarity with AIOps or advanced automation strategies
Key Success Indicators
- Increased adoption of SLOs and reliability standards
- Reduction in high-severity incidents over time
- Improved MTTR and operational efficiency
- Increased adoption of standardized observability practices
- Reduction in reactive, ticket-driven work across teams
- Clear alignment between SRE, PSE, and application teams
Core Competencies
- Strategic thinking with strong execution focus
- Ability to drive alignment across multiple teams and stakeholders
- Strong communication and influence skills
- Bias toward structure, clarity, and accountability
- Ability to operate with urgency and discipline in complex environments