Job Title: AI Site Reliability Engineer (AI SRE) Blackdog Platform
Job LocationNYC , NY - 3 days onsite
Job Type: Contract
Job Duration: Long Term.
Role Overview
We are seeking an experienced AI Site Reliability Engineer (AI SRE) to ensure the reliability, scalability, and performance of the Blackdog AI platform. This role sits at the intersection of machine learning, cloud infrastructure, and reliability engineering, owning production stability for AI/ML systems end-to-end.
You will design resilient systems, automate operations, and partner closely with ML engineers, data scientists, and product teams to run Blackdog safely and efficiently in production.
Key Responsibilities Reliability & Operations
- Own availability, latency, performance, and cost SLAs/SLOs for Blackdog AI services
- Design and maintain highly available, fault-tolerant AI infrastructure
- Lead incident response, root cause analysis, and postmortems for AI-related outages
- Implement error budgets and reliability metrics for ML systems
AI/ML Production Engineering
- Deploy, monitor, and scale ML models and inference services in production
- Manage model lifecycle reliability (training validation deployment rollback)
- Detect and respond to model drift, data skew, and inference degradation
- Partner with ML teams to productionize research models
Observability & Monitoring (Blackdog Requirements)
- Build deep observability for:
- Model performance & accuracy
- Data quality and pipeline health
- Infrastructure and service metrics
- Implement logging, tracing, and alerting tailored for AI workloads
- Ensure Blackdog monitoring covers both system health and model behavior
Automation & Infrastructure
- Automate deployments using CI/CD and MLOps pipelines
- Manage infrastructure using Infrastructure as Code (Terraform, CloudFormation, etc.)
- Optimize cloud resource usage for GPU/CPU workloads
- Improve reliability through self-healing and auto-scaling systems
Security & Compliance
- Ensure secure handling of data, models, and APIs
- Support compliance requirements relevant to AI systems
- Implement access controls and auditability for Blackdog services
Required Qualifications Technical Skills
- Strong experience in SRE, DevOps, or Platform Engineering roles
- Hands-on experience supporting production AI/ML systems
- Proficiency in Python (required); Go or Java a plus
- Strong knowledge of Linux, containers (Docker), and Kubernetes
- Experience with cloud platforms (AWS, Google Cloud Platform, or Azure)
- Practical understanding of:
- Model serving and inference pipelines
- Data pipelines and feature stores
- GPU-based workloads
Blackdog-Specific Requirements
- Experience operating mission-critical AI platforms similar to Blackdog
- Ability to define reliability standards for AI-driven products
- Strong troubleshooting skills across data, model, and infrastructure layers
Monitoring & Tooling
- Experience with observability tools (Prometheus, Grafana, Datadog, OpenTelemetry, etc.)
- Familiarity with ML monitoring tools (e.g., model metrics, drift detection)
Nice to Have
- Experience with MLOps frameworks (Kubeflow, MLflow, SageMaker, Vertex AI)
- Background in machine learning or data science
- Experience supporting real-time or high-throughput inference systems
- Exposure to AI governance, explainability, or responsible AI