Director, SRE (PL)

Overview

On Site

USD 136,900.00 - 270,000.00 per year

Full Time

Skills

Profit And Loss

Creative Problem Solving

Finance

Performance Engineering

Apache Velocity

Mentorship

Continuous Improvement

Systems Design

Operational Efficiency

Root Cause Analysis

Capacity Management

Performance Tuning

Cost Management

Version Control

Code Review

Collaboration

Leadership

Management

Software Engineering

Programming Languages

Python

Java

Cloud Computing

Amazon Web Services

Microsoft Azure

Google Cloud

Google Cloud Platform

Systems Architecture

Continuous Integration

Continuous Delivery

Configuration Management

Budget

Incident Management

Regulatory Compliance

Risk Management

Communication

Articulate

Reliability Engineering

Return On Investment

Investments

Artificial Intelligence

Machine Learning (ML)

CHAOS

Testing

Kubernetes

Terraform

Grafana

IT Operations

Job Details

Your Opportunity

We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in the specified location(s).

At Schwab, you're empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together.

We are seeking an experienced SRE Director to lead and scale our Site Reliability Engineering organization. This role requires a proven technology leader who can drive the adoption of advanced tools and methodologies, foster a culture of continuous improvement, and ensure our systems are resilient, secure, and scalable. You will be instrumental in guiding teams through complex AI Ops transformations while empowering them to embrace new technologies and build a high-performance engineering culture.

This is not a traditional operations role. We're looking for a leader who embraces the SRE philosophy: treating operations as a software engineering problem, eliminating toil through automation, and using data-driven approaches to balance reliability with velocity. You'll lead the transformation from reactive operations to proactive engineering, where reliability is designed in, not bolted on.

Key Responsibilities

Lead, mentor, and scale a high-performing team of SRE engineers and managers.
Define and execute the strategic vision for site reliability, availability, and performance across the organization.
Drive the adoption of advanced SRE practices, automation frameworks, and AI-powered operational tools.
Foster a culture of continuous improvement and blameless learning through postmortems-turning failures into opportunities for growth.
Partner with Engineering, Product, and Security teams to align SRE initiatives with business objectives.
Transform traditional operations mindset to SRE culture: shifting from reactive firefighting to proactive system design, from manual processes to software-driven automation.
Ensure systems are resilient, secure, and scalable to meet current and future business demands.
Lead transformation initiatives leveraging AI Ops and intelligent automation to enhance operational efficiency.
Establish and maintain SLIs, SLOs, and error budgets to drive reliability commitments and enable data-driven discussions about acceptable risk.
Lead automation initiatives to eliminate toil and scale operational efficiency-prioritizing code-driven solutions over manual processes.
Drive incident management excellence including root cause analysis, postmortem culture, and continuous learning.
Oversee capacity planning, performance optimization, and infrastructure cost management.
Apply software engineering principles to operations: version control, code review, testing, and CI/CD for all infrastructure and tooling.
Foster collaboration between development and operations teams through SRE principles-breaking down silos and embedding reliability into the development process.

What you have

Required Qualifications

10+ years of experience in software engineering, infrastructure, or site reliability roles.
5+ years of people leadership experience managing engineering teams and managers.
Strong software engineering background with proficiency in programming languages (Python, Go, Java, etc.)-this is not an operations-only role.
Deep expertise in cloud platforms (AWS, Azure, Google Cloud Platform) and distributed systems architecture.
Strong background in automation, CI/CD, infrastructure as code, and configuration management.
Proven track record of driving large-scale technical and operational transformations, including AI Ops adoption.
Experience implementing SLO/SLI frameworks and error budget policies.
Experience with observability tools, monitoring platforms, and incident management systems.
Strong understanding of security best practices, compliance requirements, and risk management.
Excellent communication skills with ability to influence stakeholders at all levels.
Ability to articulate the business value of reliability engineering and the ROI of automation investments.

Preferred Qualifications

Experience with AI/ML operations, AIOps platforms, and intelligent automation.
Background in chaos engineering, game days, and resilience testing.
Knowledge of modern SRE tools and practices (Kubernetes, Terraform, Data Dog, Grafana, etc.).
Experience leading the cultural transformation from traditional IT operations to SRE.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share