Senior AI Site Reliability Engineer (AI SRE)
Hybrid in New York, NY, US • Posted 8 hours ago • Updated 8 hours ago

Application Management Services LLC
Dice Job Match Score™
👾 Reticulating splines...
Job Details
Skills
- Amazon SageMaker
- Artificial Intelligence
- Continuous Integration
- Data Quality
- Data Modeling
- Cloud Computing
- Amazon Web Services
- Google Cloud Platform
- Machine Learning (ML)
- Machine Learning Operations (ML Ops)
- Regulatory Compliance
- Vertex
- Python
- Terraform
- Docker
- DevOps
- Incident Management
- Linux
- Java
- Grafana
- Root Cause Analysis
Summary
Senior AI Site Reliability Engineer (AI SRE)
Blackdog is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands‑on, owning end‑to‑end stability for mission‑critical model serving, data pipelines, and GPU‑intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for Blackdog’s AI products.
Responsibilities
Own SLOs/SLAs for availability, latency, performance, and cost across AI services
Architect and operate highly available, fault‑tolerant AI/ML infrastructure
Lead incident response, deep‑dive troubleshooting, RCA, and postmortems
Deploy, monitor, and scale ML models and real‑time inference services
Manage model lifecycle (training → validation → deployment → rollback)
Detect and mitigate model drift, data skew, and inference degradation
Build observability for model accuracy, data quality, pipelines, and system health
Implement logging, tracing, and alerting for AI workloads
Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation)
Optimize cloud compute (GPU/CPU) for performance and cost efficiency
Ensure secure handling of data, models, APIs, and compliance requirements
Must‑Have Skills
7+ years in SRE, DevOps, or Platform Engineering
Proven experience running production AI/ML systems at scale
Strong Python; Go/Java a plus
Deep expertise with Linux, Docker, Kubernetes
Cloud experience with AWS, Google Cloud Platform, or Azure
Strong understanding of model serving, inference pipelines, data pipelines, feature stores
Experience with GPU workloads and performance tuning
Advanced troubleshooting across data, model, and infrastructure layers
Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry
ML monitoring (model metrics, drift detection, inference health)
CI/CD, MLOps, IaC (Terraform, CloudFormation)
Nice to Have
Experience with Kubeflow, MLflow, SageMaker, Vertex AI
Background in ML or data science
Experience with real‑time, high‑throughput inference systems
Exposure to AI governance, explainability, or responsible AI
Success Indicators
AI services consistently exceed reliability and performance targets
Incidents decrease through strong operational rigor and automation
Models are deployed safely, quickly, and with confidence
Engineering teams rely on the platform and tooling you build
- Dice Id: 91165607
- Position Id: 8875018
- Posted 8 hours ago
Similar Jobs
It looks like there aren't any Similar Jobs for this job yet.
Search all similar jobs

