Senior Site Reliability Engineer-AI (AI SRE)

Hybrid in New York, NY, US • Posted 29 days ago • Updated 27 days ago
Contract Independent
Contract W2
No Travel Required
Hybrid
$60 - $70/hr
Company Branding Image
Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

  • Amazon SageMaker
  • Artificial Intelligence
  • Continuous Integration
  • Data Quality
  • Data Modeling
  • Cloud Computing
  • Amazon Web Services
  • Google Cloud Platform
  • Machine Learning (ML)
  • Machine Learning Operations (ML Ops)
  • Regulatory Compliance
  • Vertex
  • Python
  • Terraform
  • Docker
  • DevOps
  • Incident Management
  • Linux
  • Java
  • Grafana
  • Root Cause Analysis

Summary

Senior AI Site Reliability Engineer (AI SRE)

Blackdog is hiring a Senior AI Site Reliability Engineer to lead the reliability, scalability, and performance of our production AI/ML platform. This role is deeply technical and hands‑on, owning end‑to‑end stability for mission‑critical model serving, data pipelines, and GPU‑intensive workloads. You will architect resilient systems, drive automation, and set reliability standards for Blackdog’s AI products.

Responsibilities

  • Own SLOs/SLAs for availability, latency, performance, and cost across AI services

  • Architect and operate highly available, fault‑tolerant AI/ML infrastructure

  • Lead incident response, deep‑dive troubleshooting, RCA, and postmortems

  • Deploy, monitor, and scale ML models and real‑time inference services

  • Manage model lifecycle (training → validation → deployment → rollback)

  • Detect and mitigate model drift, data skew, and inference degradation

  • Build observability for model accuracy, data quality, pipelines, and system health

  • Implement logging, tracing, and alerting for AI workloads

  • Automate CI/CD and MLOps pipelines; manage IaC (Terraform, CloudFormation)

  • Optimize cloud compute (GPU/CPU) for performance and cost efficiency

  • Ensure secure handling of data, models, APIs, and compliance requirements

Must‑Have Skills

  • 7+ years in SRE, DevOps, or Platform Engineering

  • Proven experience running production AI/ML systems at scale

  • Strong Python; Go/Java a plus

  • Deep expertise with Linux, Docker, Kubernetes

  • Cloud experience with AWS, Google Cloud Platform, or Azure

  • Strong understanding of model serving, inference pipelines, data pipelines, feature stores

  • Experience with GPU workloads and performance tuning

  • Advanced troubleshooting across data, model, and infrastructure layers

  • Observability tools: Prometheus, Grafana, Datadog, OpenTelemetry

  • ML monitoring (model metrics, drift detection, inference health)

  • CI/CD, MLOps, IaC (Terraform, CloudFormation)

Nice to Have

  • Experience with Kubeflow, MLflow, SageMaker, Vertex AI

  • Background in ML or data science

  • Experience with real‑time, high‑throughput inference systems

  • Exposure to AI governance, explainability, or responsible AI

Success Indicators

  • AI services consistently exceed reliability and performance targets

  • Incidents decrease through strong operational rigor and automation

  • Models are deployed safely, quickly, and with confidence

  • Engineering teams rely on the platform and tooling you build

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91165607
  • Position Id: 8875018
  • Posted 29 days ago

Company Info

About Application Management Services LLC

About_Company_OneAbout_Company_Two
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

It looks like there aren't any Similar Jobs for this job yet.

Search all similar jobs