ML Ops Lead

  • Posted 3 hours ago | Updated 3 hours ago

Overview

Remote
Depends on Experience
Contract - W2
Contract - Independent
Contract - 12 Month(s)
No Travel Required
Unable to Provide Sponsorship

Skills

Amazon SageMaker
Amazon Web Services
Artificial Intelligence
Data Science
Machine Learning (ML)
Gen AI
RAG
Langchain
Machine Learning Operations (ML Ops)
Computer Science
Team Leadership

Job Details

Job Summary

The ML Ops Lead is responsible for architecting, implementing, and operating scalable and reliable machine learning infrastructure and workflows that take AI/ML models from experimentation into robust production environments. This role balances hands-on engineering excellence with team leadership and strategic ownership of machine learning operations practices across the organization.


Key Responsibilities

1. ML Infrastructure & Operations

  • Architect and maintain scalable ML infrastructure, including compute, storage, orchestration, and monitoring, in cloud (AWS, Azure, Google Cloud Platform) and/or hybrid environments.
  • Build and manage end-to-end ML pipelines for data ingestion, model training, validation, deployment, monitoring, and retraining.
  • Containerize and orchestrate workloads using Docker, Kubernetes (EKS/AKS/GKE), Terraform, or similar IaC tools.

2. CI/CD & Automation

  • Design and operate CI/CD workflows for ML workflows (model retraining, version control, deployment, rollback).
  • Automate testing, validation, and release processes for production ML systems.

3. Production Reliability & Monitoring

  • Establish monitoring, logging, observability, drift detection, and alerting for deployed models.
  • Troubleshoot operational issues, optimize performance, and ensure high availability and scalability.

4. Leadership & Strategic Ownership

  • Lead, mentor, and grow a team of MLOps engineers & platform specialists.
  • Drive ML Ops strategy and roadmap, aligned with business goals and regulatory standards.
  • Collaborate closely with data scientists, software engineers, product owners, and DevOps teams to deliver production-ready models.

5. Governance & Best Practices

  • Implement governance, security, auditability, and compliance practices across ML operations.
  • Define and promote ML lifecycle best practices, documentation standards, and performance metrics.

Skills & Qualifications

Technical Expertise

  • Strong experience with ML Ops tools/frameworks: Kubeflow, MLflow, Airflow, TensorFlow Serving, TorchServe, Sagemaker, Azure ML, etc.
  • Proficiency in cloud platforms (AWS, Google Cloud Platform, Azure) and orchestration technologies (Docker & Kubernetes).
  • Solid background in Infrastructure as Code (Terraform, CloudFormation, Bicep, CDK).
  • Deep understanding of CI/CD pipelines, automation tooling, and version control systems.
  • Monitoring and observability tooling experience (Prometheus, Grafana, Azure Monitor, etc.).

Soft & Leadership Skills

  • Demonstrated ability to lead technical teams and mentor engineering talent.
  • Excellent communication and cross-functional collaboration skills.
  • Strategic mindset with a focus on reliability, scalability, and operational excellence.

Education & Experience

  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, Data Science, or related field.
  • Typically 7+ years of experience in cloud/DevOps/ML Ops related roles; senior experience preferred depending on scale of operations.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.