AI SRE with Blackdog Platform

New York, NY, US • Posted 1 day ago • Updated 2 hours ago

Contract W2

Contract Corp To Corp

On-site

Fitment

Dice Job Match Score™

🤯 Applying directly to the forehead...

Job Details

Skills

Summary

Job Title: AI Site Reliability Engineer (AI SRE) Blackdog Platform

Job LocationNYC , NY - 3 days onsite

Job Type: Contract

Job Duration: Long Term.

Role Overview

We are seeking an experienced AI Site Reliability Engineer (AI SRE) to ensure the reliability, scalability, and performance of the Blackdog AI platform. This role sits at the intersection of machine learning, cloud infrastructure, and reliability engineering, owning production stability for AI/ML systems end-to-end.

You will design resilient systems, automate operations, and partner closely with ML engineers, data scientists, and product teams to run Blackdog safely and efficiently in production.

Key Responsibilities Reliability & Operations

Own availability, latency, performance, and cost SLAs/SLOs for Blackdog AI services
Design and maintain highly available, fault-tolerant AI infrastructure
Lead incident response, root cause analysis, and postmortems for AI-related outages
Implement error budgets and reliability metrics for ML systems

AI/ML Production Engineering

Deploy, monitor, and scale ML models and inference services in production
Manage model lifecycle reliability (training validation deployment rollback)
Detect and respond to model drift, data skew, and inference degradation
Partner with ML teams to productionize research models

Observability & Monitoring (Blackdog Requirements)

Build deep observability for:
- Model performance & accuracy
- Data quality and pipeline health
- Infrastructure and service metrics
Implement logging, tracing, and alerting tailored for AI workloads
Ensure Blackdog monitoring covers both system health and model behavior

Automation & Infrastructure

Automate deployments using CI/CD and MLOps pipelines
Manage infrastructure using Infrastructure as Code (Terraform, CloudFormation, etc.)
Optimize cloud resource usage for GPU/CPU workloads
Improve reliability through self-healing and auto-scaling systems

Security & Compliance

Ensure secure handling of data, models, and APIs
Support compliance requirements relevant to AI systems
Implement access controls and auditability for Blackdog services

Required Qualifications Technical Skills

Strong experience in SRE, DevOps, or Platform Engineering roles
Hands-on experience supporting production AI/ML systems
Proficiency in Python (required); Go or Java a plus
Strong knowledge of Linux, containers (Docker), and Kubernetes
Experience with cloud platforms (AWS, Google Cloud Platform, or Azure)
Practical understanding of:
- Model serving and inference pipelines
- Data pipelines and feature stores
- GPU-based workloads

Blackdog-Specific Requirements

Experience operating mission-critical AI platforms similar to Blackdog
Ability to define reliability standards for AI-driven products
Strong troubleshooting skills across data, model, and infrastructure layers

Monitoring & Tooling

Experience with observability tools (Prometheus, Grafana, Datadog, OpenTelemetry, etc.)
Familiarity with ML monitoring tools (e.g., model metrics, drift detection)

Nice to Have

Experience with MLOps frameworks (Kubeflow, MLflow, SageMaker, Vertex AI)
Background in machine learning or data science
Experience supporting real-time or high-throughput inference systems
Exposure to AI governance, explainability, or responsible AI

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10113809
Position Id: 2026-103618
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

AI SRE with Blackdog Platform

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs