AI SRE with Blackdog Platform

New York, NY, US • Posted 27 days ago • Updated 6 days ago
Contract Corp To Corp
Contract W2
On-site
Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

  • sre

Summary

Job Title: AI Site Reliability Engineer (AI SRE) Blackdog Platform

Job LocationNYC , NY - 3 days onsite

Job Type: Contract

Job Duration: Long Term.

Role Overview

We are seeking an experienced AI Site Reliability Engineer (AI SRE) to ensure the reliability, scalability, and performance of the Blackdog AI platform. This role sits at the intersection of machine learning, cloud infrastructure, and reliability engineering, owning production stability for AI/ML systems end-to-end.

You will design resilient systems, automate operations, and partner closely with ML engineers, data scientists, and product teams to run Blackdog safely and efficiently in production.

Key Responsibilities Reliability & Operations
  • Own availability, latency, performance, and cost SLAs/SLOs for Blackdog AI services
  • Design and maintain highly available, fault-tolerant AI infrastructure
  • Lead incident response, root cause analysis, and postmortems for AI-related outages
  • Implement error budgets and reliability metrics for ML systems
AI/ML Production Engineering
  • Deploy, monitor, and scale ML models and inference services in production
  • Manage model lifecycle reliability (training validation deployment rollback)
  • Detect and respond to model drift, data skew, and inference degradation
  • Partner with ML teams to productionize research models
Observability & Monitoring (Blackdog Requirements)
  • Build deep observability for:
    • Model performance & accuracy
    • Data quality and pipeline health
    • Infrastructure and service metrics
  • Implement logging, tracing, and alerting tailored for AI workloads
  • Ensure Blackdog monitoring covers both system health and model behavior
Automation & Infrastructure
  • Automate deployments using CI/CD and MLOps pipelines
  • Manage infrastructure using Infrastructure as Code (Terraform, CloudFormation, etc.)
  • Optimize cloud resource usage for GPU/CPU workloads
  • Improve reliability through self-healing and auto-scaling systems
Security & Compliance
  • Ensure secure handling of data, models, and APIs
  • Support compliance requirements relevant to AI systems
  • Implement access controls and auditability for Blackdog services
Required Qualifications Technical Skills
  • Strong experience in SRE, DevOps, or Platform Engineering roles
  • Hands-on experience supporting production AI/ML systems
  • Proficiency in Python (required); Go or Java a plus
  • Strong knowledge of Linux, containers (Docker), and Kubernetes
  • Experience with cloud platforms (AWS, Google Cloud Platform, or Azure)
  • Practical understanding of:
    • Model serving and inference pipelines
    • Data pipelines and feature stores
    • GPU-based workloads
Blackdog-Specific Requirements
  • Experience operating mission-critical AI platforms similar to Blackdog
  • Ability to define reliability standards for AI-driven products
  • Strong troubleshooting skills across data, model, and infrastructure layers
Monitoring & Tooling
  • Experience with observability tools (Prometheus, Grafana, Datadog, OpenTelemetry, etc.)
  • Familiarity with ML monitoring tools (e.g., model metrics, drift detection)
Nice to Have
  • Experience with MLOps frameworks (Kubeflow, MLflow, SageMaker, Vertex AI)
  • Background in machine learning or data science
  • Experience supporting real-time or high-throughput inference systems
  • Exposure to AI governance, explainability, or responsible AI
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10113809
  • Position Id: 2026-103618
  • Posted 27 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in New York, New York

27d ago

Easy Apply

Contract

60 - 70

Jersey City, New Jersey

Today

Easy Apply

Contract

$62 - $65 per hour

New York, New York

5d ago

Easy Apply

Full-time, Third Party

Depends on Experience

New York, New York

15d ago

Easy Apply

Contract

$70 - $90

Search all similar jobs