***Position is bonus eligible***
Prestigious Financial Institution is currently seeking a Director Enterprise Platform Governance and Strategy with strong SRE leadership experience. Candidate will manage a team of engineering managers and senior technical staff across Site Reliability Engineering, Cloud Architecture, and Metrics & Reporting functions. The leader in this role is accountable for scaling a mature SRE practice, driving cloud architectural standards and multi-year strategy, and ensuring the organization operates with clear, data-driven visibility into platform health and performance. A critical dimension of this role is ownership of the FinOps and SecOps domains as Product Manager, alongside governance of PE compliance obligations spanning incidents, risks, and audit findings.
Responsibilities:
Site Reliability Engineering
- Lead the scaling and maturation of the SRE practice, establishing error budgets, SLOs, SLAs, and incident response frameworks across all platform services.
- Define and enforce reliability standards including on-call models, blameless postmortem processes, and corrective action tracking to drive continuous improvement.
- Partner with Platform Foundation teams (Kubernetes, Kafka, FinOps/Security) to embed reliability principles into build and operate models.
- Champion toil reduction through automation, ensuring engineering capacity is redirected from manual operations to higher-value platform capabilities.
Platform Engineering Governance & Compliance
- Serve as Product Manager for the FinOps and SecOps domains within Platform Engineering, owning the product vision, prioritization, and stakeholder alignment for governance tooling and practices.
- Establish and maintain a governance framework ensuring Platform Engineering adheres to organizational standards across incident and problem management, SORTs, risk tracking, and audit findings.
- Own the end-to-end process for PE compliance obligations, ensuring timely resolution and closure of incidents, problem tickets, risk items, and audit observations with clear accountability and tracking.
- Partner with Risk, Compliance, and Security functions to proactively identify governance gaps, drive remediation, and ensure PE operates within the organization's risk appetite.
- Maintain visibility and reporting on PE's compliance posture across all obligation types, surfacing trends, aging items, and residual risks to CARE leadership and relevant stakeholders.
Site Reliability Engineering COE
- Lead the scaling and maturation of the SRE practice, establishing error budgets, SLOs, SLAs, and incident response frameworks across all platform services.
- Define and enforce reliability standards including on-call models, blameless postmortem processes, and corrective action tracking to drive continuous improvement.
- Partner with Platform Engineering Product teams (Kubernetes, Kafka, FinOps/Security) to embed reliability principles into build and operate models.
- Champion toil reduction through automation, ensuring engineering capacity is redirected from manual operations to higher-value platform capabilities.
Cloud Strategy & Architecture
- Define and execute the multi-year cloud architecture strategy aligned to business growth, scalability, regulatory compliance, and cost optimization goals.
- Establish cloud architectural standards, reference architectures, and governance frameworks (landing zones, identity, network patterns, service catalog) and drive adoption across engineering.
- Guide cloud-native architecture decisions including containers/orchestration, IaaS/PaaS adoption, disaster recovery, and multi-region patterns with a steady eye on regulatory requirements (e.g., CIS, NIST).
- Oversee technology roadmaps and end-of-life planning for cloud platform components, ensuring forward-looking decisions balance innovation with operational stability.
- Serve as a key technical advisor to senior leadership, translating complex architectural trade-offs into clear business decisions.
Metrics & Reporting
- Own the platform metrics and reporting function, establishing a consistent framework for measuring platform health, engineering velocity, reliability, and cost efficiency across CARE.
- Define and track KPIs aligned to internal SLAs, executive reporting needs, and audit/compliance requirements.
- Ensure Jira and other platform tooling serve as the single source of truth for work visibility, with dashboards and reporting that enable data-driven prioritization.
- Build and maintain reporting cadences for leadership, including platform health scorecards, capacity forecasting, and risk transparency.
PM Coordination & Platform Delivery
- Serve as the primary engineering leadership partner to the Platform Engineering Program Management function, ensuring platform initiatives are properly scoped, sequenced, and resourced.
- Drive alignment between engineering capacity and roadmap commitments, proactively surfacing dependency risks and trade-off decisions to the CARE Executive Director.
- Coordinate across PE domains to ensure cross-team delivery dependencies are managed and resolved effectively.
- Partner with Product and Engineering leaders outside of CARE to align platform capabilities to broader organizational roadmaps.
Leadership & Organizational Excellence
- Lead, develop, and retain a high-performing team of engineering managers and individual contributors with clear ownership, career paths, and accountability frameworks.
- Foster a culture consistent with CARE operating principles: automation-first, full-stack ownership, stability as a prerequisite for velocity, and transparency through tooling.
- Manage budget for areas of responsibility; ensure adherence to schedules, work plans, and performance requirements.
- Oversee remediation of audit findings and observations within areas of responsibility, ensuring root cause is addressed, residual risk is reduced, and remediation is completed timely.
- Maintain appropriate work/life balance within teams while upholding a high standard of delivery quality
Qualifications:
- [Required] Proven executive-level leadership of SRE, cloud engineering, or platform reliability organizations in a regulated industry environment.
- [Required] Demonstrated ability to build and scale SRE practices including SLO/SLA frameworks, on-call models, error budgets, and incident response programs.
- [Required] Deep expertise in cloud architecture strategy and governance, with experience defining and driving enterprise-wide architectural standards.
- [Required] Strong track record of cross-functional partnership with Program/Product Management, translating platform capabilities into sequenced, delivery-ready roadmaps.
- [Required] Demonstrated experience serving in a Product Manager capacity for technical domains such as FinOps, SecOps, or platform tooling, including ownership of roadmap, prioritization, and stakeholder alignment.
- [Required] Experience establishing and managing governance and compliance frameworks within a platform or infrastructure engineering organization, including oversight of incidents, problem management, risk items, and audit obligations.
- [Required] Ability to design and maintain metrics and reporting frameworks that provide meaningful visibility into platform health, engineering performance, and compliance posture.
- [Required] Exceptional written and verbal communication skills; ability to translate technical complexity into executive-level insights and business decisions.
- [Required] Demonstrated ability to lead high-performing, highly technical teams through accountability, coaching, and clear ownership models.
- [Required] Experience managing work in Agile/Scrum environments with strong prioritization and deadline management discipline.
- [Preferred] Experience operating in a production change control process and working directly with audit and compliance functions in a regulated environment.
- [Preferred] Experience in financial services or similarly regulated industries with exposure to CIS, NIST, and related frameworks.
Technical Skills:
- [Required] Deep knowledge of SRE tooling and observability platforms (e.g., Prometheus, Grafana, PagerDuty, Datadog, or equivalents).
- [Required] Expert-level knowledge of cloud platforms: AWS, Azure, or Google Cloud Platform; experience with multi-cloud or hybrid environments preferred.
- [Required] Strong working knowledge of cloud-native architecture patterns and Infrastructure as Code principles.
- [Required] Familiarity with container orchestration and streaming platforms (Kubernetes, Kafka) and CI/CD tooling (GitHub Actions, Jenkins, or equivalents).
- [Required] Experience with metrics and reporting platforms; ability to design KPI frameworks and reporting dashboards for both technical and executive audiences.
- [Required] Working knowledge of FinOps principles and cloud cost governance, with experience driving cost transparency and optimization at an organizational level.
- [Required] Familiarity with SecOps tooling and security governance practices within a cloud or platform engineering context.
- [Preferred] Experience with GRC tooling or platforms used to manage risk, audit findings, and compliance obligations (e.g., ServiceNow, Archer, or equivalents).
Education and/or Experience:
- [Required] Bachelor's degree, preferably in a technical discipline (Computer Science, Mathematics, Engineering, or related field), or equivalent combination of education and experience.
- [Required] 15+ years of progressive experience in cloud engineering, platform reliability, or infrastructure roles with at least 5 years in senior engineering leadership.
- [Preferred] Depth and breadth of experience in a highly regulated industry such as financial services, with demonstrated understanding of applicable rules and regulatory frameworks.
- [Required] AWS Solutions Architect Associate Certification or higher strongly desired.
- [Preferred] Google Cloud Professional Cloud Architect, Microsoft Azure Solutions Architect, or equivalent certification.
- [Preferred] Relevant SRE or reliability-focused certifications a plus.