Principal Site Reliability Engineer - AI

Overview

On Site
200k - 250k
Full Time

Skills

Real-time
Recruiting
Scalability
Product Development
Leadership
IaaS
Incident Management
Root Cause Analysis
Data Engineering
Continuous Integration
Continuous Delivery
Roadmaps
Mentorship
Capacity Management
Performance Testing
Hardening
Collaboration
Regulatory Compliance
HIPAA
System On A Chip
Privacy
Operational Excellence
DevOps
Kubernetes
Orchestration
Microservices
Amazon Web Services
Google Cloud
Google Cloud Platform
Microsoft Azure
Terraform
Scripting
Python
Bash
Reliability Engineering
Stacks Blockchain
Grafana
Health Care
SaaS
Machine Learning (ML)
Lifecycle Management
Communication
Artificial Intelligence
Cloud Computing
Job Boards
LinkedIn

Job Details

About Our Client
Our client is an AI-driven health-tech start-up on a mission to transform patient care through intelligent, secure, and highly reliable clinical automation tools. Their platform powers real-time insights for clinicians, improving patient outcomes and enabling healthcare systems to operate with unprecedented efficiency. They are entering a high-growth phase and are seeking a Principal Site Reliability Engineer to help scale their infrastructure and ensure world-class reliability.
Role Overview
Our client is hiring a Principal Site Reliability Engineer to serve as the technical authority for the reliability, scalability, and performance of their cloud-native infrastructure. This individual will design and implement systems that support rapid product development while meeting the resilience requirements of clinical-grade AI applications. The role blends hands-on engineering with architectural leadership and cross-functional collaboration across product, ML, infrastructure, and security teams.
What You'll Do
  • Architect, build, and optimize scalable, secure, and highly available cloud infrastructure (AWS/Google Cloud Platform/Azure).
  • Lead incident response, root-cause analysis, and production reliability improvements across the platform.
  • Implement observability frameworks (metrics, tracing, logging) that provide deep visibility into system performance.
  • Partner with ML and data engineering teams to operationalize AI/ML pipelines, ensuring reliability from data ingestion through model deployment.
  • Develop automated CI/CD pipelines, infrastructure-as-code, and guardrails for safer, faster deployments.
  • Define SLOs/SLIs and establish long-term reliability roadmaps aligned with clinical-grade requirements.
  • Mentor SREs and software engineers, promoting DevOps and reliability best practices across engineering.
  • Lead capacity planning, performance testing, and system hardening initiatives.
  • Collaborate with security teams to ensure compliance with HIPAA, SOC 2, and relevant privacy and security standards.
  • Evaluate new technologies and drive adoption of tools that improve operational excellence.
What They're Looking For
  • 8+ years in SRE, DevOps, Infrastructure Engineering, or related fields.
  • Deep expertise with Kubernetes, container orchestration, and microservices architecture.
  • Strong experience with cloud platforms (AWS/Google Cloud Platform/Azure) and infrastructure-as-code tools such as Terraform, Pulumi, or CloudFormation.
  • Advanced proficiency in automation/scripting languages such as Python, Go, or Bash.
  • Strong knowledge of distributed systems, reliability engineering patterns, and modern observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.).
  • Experience supporting highly regulated or mission-critical environments (healthcare, fintech, SaaS).
  • Hands-on experience with ML infrastructure, model lifecycle management, or data pipelines is a plus.
  • Excellent communication skills and a proactive, ownership-oriented mindset.
Why Candidates Love This Role
  • Mission-driven work that directly influences patient care and health outcomes.
  • Ownership of foundational infrastructure in a rapidly scaling AI start-up.
  • Competitive compensation, equity, and benefits.
  • A modern, cloud-native tech stack with the ability to shape future architecture.
  • A collaborative and innovative engineering culture.

If you'd like, I can also create:
  • a shorter/condensed version
  • a more formal corporate version
  • a job-board-optimized version (LinkedIn, Indeed, etc.)
  • a version tailored to a specific tech stack

Just let me know!
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Motion Recruitment Partners, LLC