Apply Now

Principal Software Engineer - AI Platform (Production Engineering / Reliability)

Remote • Posted 3 hours ago • Updated 3 hours ago

Full Time

Remote

USD $144,200.00 - 288,400.00 per year

Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

Accountability
Health Care
Integrated Circuit
Internal Communications
IC
Leadership
Budget
Root Cause Analysis
Performance Monitoring
Extract
Transform
Load
Dashboard
Real-time
Training
CHAOS
Testing
IT Management
Mentorship
Reliability Engineering
Software Engineering
Production Engineering
Production Support
Cloud Computing
Microsoft Azure
Amazon Web Services
Google Cloud Platform
Google Cloud
Artificial Intelligence
Machine Learning (ML)
Lifecycle Management
Machine Learning Operations (ML Ops)
Grafana
Kubernetes
Streaming
Apache Kafka
High Availability
Scalability
Debugging
Incident Management
Operational Excellence
Finance

Summary

We're building a world of health around every individual - shaping a more connected, convenient and compassionate health experience. At CVS Health , you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger - helping to simplify health care one person, one family and one community at a time.

Overview
We are seeking a Principal Individual Contributor (IC) to lead production engineering, observability, and operational excellence for our AI Platform. This role sits at the intersection of ML systems, distributed infrastructure, and production reliability, ensuring that our AI services are scalable, observable, and resilient in real-world environments.
As a senior technical leader, you will define and drive best-in-class production practices, build robust monitoring and alerting ecosystems, and partner across engineering, ML, and platform teams to ensure mission-critical AI systems meet high availability, performance, and reliability standards.

Key Responsibilities
Production Reliability & Operations Leadership

Own and evolve production operations strategy for AI/ML platforms and services
Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
Lead root cause analysis (RCA) and drive systemic improvements post-incident
Establish operational readiness standards for launching new AI capabilities
Build frameworks for on-call excellence, incident response, and escalation

Observability, Monitoring & Alerting

Design and implement end-to-end observability systems across AI workloads:
- Model performance monitoring
- Data pipeline health
- Infrastructure metrics
Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
Define actionable, low-noise alerts tied to business and system impact
Develop dashboards and telemetry standards for real-time visibility across services
Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems

AI/ML Production Systems Excellence

Ensure reliable deployment and operation of:
- Real-time inference services
- Model pipelines (training, validation, deployment)
- Data ingestion and feature pipelines
Implement model observability (drift detection, data skew, performance degradation)
Partner with ML engineers to improve production readiness of models
Establish lifecycle standards for models in production environments

Automation & Platform Development

Build internal platforms and tooling for:
- Automated incident detection and response
- Self-healing systems
- Deployment validation and canarying
Drive Infrastructure as Code (IaC) and policy automation
Improve system resilience through chaos testing and fault injection

Technical Leadership & Strategy

Act as a trusted technical advisor across platform, ML, and product teams
Set direction for operational excellence in AI systems at org scale
Mentor senior engineers and influence cross-team architectural decisions
Lead adoption of industry best practices in reliability engineering and observability

Required Qualifications

10+ years in software engineering, production engineering, or SRE roles
Deep experience operating large-scale distributed systems in production
Proven track record building monitoring, observability, and alerting systems
Strong expertise in incident management and production support models
Experience working with cloud platforms (Azure, AWS, Google Cloud Platform)

Preferred Qualifications

Experience supporting AI/ML platforms or data-intensive systems
Familiarity with model lifecycle management and MLOps practices
Knowledge of:
- OpenTelemetry, Prometheus, Grafana, Datadog
- Kubernetes and containerized workloads
- Streaming systems (Kafka, Event Hub, etc.)
Experience defining and implementing SLO-driven engineering
Background in high-availability, low-latency systems

Key Competencies

Systems thinking and ability to reason about complex, interdependent systems
Strong bias for automation, scalability, and long-term solutions
Exceptional debugging and incident management skills
Ability to influence without authority across multiple teams
Passion for operational excellence and reliability

Pay Range

The typical pay range for this role is:

$144,200.00 - $288,400.00

This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls. The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors. This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above. This position also includes an award target in the company's equity award program.

Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.

Great benefits for great people

We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families.

This full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.

Additional details about available benefits are provided during the application process and on Benefits Moments.

We anticipate the application window for this opening will close on: 06/04/2026

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state and local laws.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 80180635
Position Id: dff925cf5dbe066afd403494112d6f48
Posted 3 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Remote

•

Today

Basic Qualifications Bachelor's degree in Engineering, plus a minimum of 10 years of relevant experience; or Master's degree in Engineering, plus a minimum of 8 years of relevant experience. Responsibilities for this Position What You'll Own The platform architecture. You will define the technical vision for the AI platform - its service boundaries, data flows, integration contracts, and deployment topology. This is hands-on architecture work, not from a whiteboard, but from working code and

Full-time

USD 200,723.00 - 222,678.00 per year

Principal AIOps Engineer

Remote or Pennsylvania

•

Today

Full-time

USD 144,200.00 - 288,400.00 per year

Senior Manager- Software Development Engineering-AI

Remote or Texas

•

Today

Full-time

USD 106,605.00 per year

Lead AI Engineer - Remote

Remote or Minnetonka, Minnesota

•

Today

Genesis10 is currently seeking a Lead AI Engineer for a remote position with a Leading Risk Intelligence Firm. This is a direct hire opportunity requiring work during East Coast (EST) business hours. Pay range: $145,000 - $150,000 annually As a Lead AI Engineer on the AI Pipeline team, you will architect and build scalable AI systems that identify and alert customers to emerging risks in real time. This role is heavily focused on GenAI infrastructure, LLM-powered backend services, and distribute

Easy Apply

Full-time

Compensation information provided in the description

Search all similar jobs

Principal Software Engineer - AI Platform (Production Engineering / Reliability)

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs