Sr. AI Site Reliability Engineer, AI.x

Austin, TX, US • Posted 30+ days ago • Updated 5 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

🔗 Matching skills to job...

Job Details

Skills

Creative Problem Solving
Finance
Product Engineering
Generative Artificial Intelligence (AI)
Customer Experience
Business Strategy
Roadmaps
Automated Testing
Apache Velocity
Real-time
Root Cause Analysis
Management
Service Level
Budget
Provisioning
Configuration Management
Operational Efficiency
Collaboration
Scalability
Capacity Management
Optimization
Artificial Intelligence
Software Engineering
Startups
Scratch
Continuous Integration
Continuous Delivery
High Availability
Cloud Computing
Computer Science
Open Source
Communication
Incident Management
Reliability Engineering
Terraform
Google Cloud Platform
Google Cloud

Summary

Your Opportunity

At Schwab, you will build a rewarding career while making a difference in the lives of our millions of clients. Here, innovative thinking meets creative problem solving as we work together to challenge the status quo. Joining Schwab means joining a company committed to transforming the financial industry and putting clients at the center of everything we do.

Schwab's AI Strategy & Transformation team, known as AI.x, is the central hub for Artificial Intelligence at Schwab. We are an integrated product, engineering, strategy and risk team, all based in San Francisco. We help set the enterprise vision for AI, invest in the most promising opportunities, and accelerate delivery across the company. We also build the core platform that powers AI at scale and explore next-generation GenAI efforts that will redefine how we serve our clients. As a Senior Engineer on AI.x, you will play a key role in bringing these priorities to life by designing and delivering innovative AI solutions.

This role is an opportunity to join a high-profile team shaping Schwab's future with AI, to build solutions that matter to millions of clients, and to grow your career in one of the most exciting areas of technology today.

As a Senior AI Site Reliability Engineer you will support reliability efforts for cutting-edge GenAI applications that enhance the client experience and create value. You will work closely with architects and engineers to ensure scalability, reliability and security of solutions that build towards an enterprise strategy. You will lead automation-first initiatives, build robust CI/CD pipelines for one-touch deployments, and implement comprehensive observability frameworks to minimize MTTD and MTTR. This role requires participation in on-call rotations to ensure 24/7 reliability of critical AI systems. Above all, you will apply the rigor, discipline, and technical depth to help shape the next generation of AI at Schwab.

Roles & Responsibilities:

Lead automation-first initiatives to eliminate toil and manual interventions, defining and executing the strategic roadmap for reliability, observability, and self-healing systems across AI.x platforms
Design and implement robust CI/CD pipelines enabling one-touch deployments with automated testing, validation, and rollback capabilities to accelerate delivery velocity and reduce deployment risk
Implement comprehensive observability frameworks for real-time monitoring of AI services, including metrics, logs, and traces, with intelligent alerting and automated diagnostics to minimize MTTD and MTTR
Participate in on-call rotation providing 24/7 support for production AI systems, ensuring rapid incident response, root cause analysis, and resolution with measurable SLO targets
Establish and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and incident response runbooks to drive continuous reliability improvements
Champion Infrastructure-as-Code (IaC) practices and automate environment provisioning, configuration management, and deployment processes to ensure consistency, repeatability, and operational efficiency
Collaborate seamlessly with AI Engineering teams to integrate SRE practices early in the development lifecycle, promoting a culture of reliability and shared responsibility
Proactively identify and resolve reliability, performance, and scalability issues through data-driven analysis, capacity planning, and system optimization
Implement and maintain monitoring, alerting, and incident response frameworks to ensure system health and reliability, maximizing production availability
Champion reliability, monitoring, observability, and operational best practices for AI systems and data pipelines, establishing patterns and standards for the organization

What you have

Required Qualifications

8+ years of software engineering experience, with 4+ years as a hands-on Site Reliability Engineer in startups and/or large organizations.
Bachelor's degree in Computer Science or related field, or equivalent experience.
5+ years building complex products from scratch, running them in production, and ensuring operational reliability.
3+ years working with containers and cloud-native applications, operationalizing them in the public cloud with infrastructure as code and CI/CD pipelines.
3+ years of experience working in high-availability hybrid-cloud environments.

Preferred Qualifications

Strong computer science fundamentals and experience across the tech stack.
Experience with proprietary or open-source LLMs (e.g., Gemini, Claude, OpenAI), deploying LLM-powered applications to production and maintaining availability.
Strong written and verbal communication skills to clearly convey ideas and feedback.
Strong understanding of observability, incident management and reliability engineering principles.
Mindset of continuous learning and improvement, adept at both giving and receiving feedback.
Ability to troubleshoot complex problems with ambiguous or incomplete data in distributed systems.
Curiosity about new technologies and processes, proactively sharing knowledge and seeking improvement.
Experience with Terraform and Google Cloud Platform.

In addition to the salary range, this role is eligible for bonus or incentive opportunities.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90989465
Position Id: 6ecfca1b26cc2b9c9cbdd7db5b95edea
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Sr. AI Site Reliability Engineer, AI.x

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs