Senior Site Reliability Engineer-AI (AI SRE)

Hybrid in New York, NY, US • Posted 29 days ago • Updated 27 days ago

Contract Independent

Contract W2

No Travel Required

Hybrid

$60 - $70/hr

Application Management Services LLC

Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

Amazon SageMaker
Artificial Intelligence
Continuous Integration
Data Quality
Data Modeling
Cloud Computing
Amazon Web Services
Google Cloud Platform
Machine Learning (ML)
Machine Learning Operations (ML Ops)
Regulatory Compliance
Vertex
Python
Terraform
Docker
DevOps
Incident Management
Linux
Java
Grafana
Root Cause Analysis

Summary

Senior AI Site Reliability Engineer (AI SRE)

Blackdog is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands‑on, owning end‑to‑end stability for mission‑critical model serving, data pipelines, and GPU‑intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for Blackdog’s AI products.

Responsibilities

Own SLOs/SLAs for availability, latency, performance, and cost across AI services
Architect and operate highly available, fault‑tolerant AI/ML infrastructure
Lead incident response, deep‑dive troubleshooting, RCA, and postmortems
Deploy, monitor, and scale ML models and real‑time inference services
Manage model lifecycle (training → validation → deployment → rollback)
Detect and mitigate model drift, data skew, and inference degradation
Build observability for model accuracy, data quality, pipelines, and system health
Implement logging, tracing, and alerting for AI workloads
Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation)
Optimize cloud compute (GPU/CPU) for performance and cost efficiency
Ensure secure handling of data, models, APIs, and compliance requirements

Must‑Have Skills

7+ years in SRE, DevOps, or Platform Engineering
Proven experience running production AI/ML systems at scale
Strong Python; Go/Java a plus
Deep expertise with Linux, Docker, Kubernetes
Cloud experience with AWS, Google Cloud Platform, or Azure
Strong understanding of model serving, inference pipelines, data pipelines, feature stores
Experience with GPU workloads and performance tuning
Advanced troubleshooting across data, model, and infrastructure layers
Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry
ML monitoring (model metrics, drift detection, inference health)
CI/CD, MLOps, IaC (Terraform, CloudFormation)

Nice to Have

Experience with Kubeflow, MLflow, SageMaker, Vertex AI
Background in ML or data science
Experience with real‑time, high‑throughput inference systems
Exposure to AI governance, explainability, or responsible AI

Success Indicators

AI services consistently exceed reliability and performance targets
Incidents decrease through strong operational rigor and automation
Models are deployed safely, quickly, and with confidence
Engineering teams rely on the platform and tooling you build

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91165607
Position Id: 8875018
Posted 29 days ago

Company Info

About Application Management Services LLC

Go to company profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

It looks like there aren't any Similar Jobs for this job yet.

Search all similar jobs