ML Ops Lead

  • New York, NY
  • Posted 6 hours ago | Updated 6 hours ago

Overview

Hybrid
Depends on Experience
Full Time
Accepts corp to corp applications
Able to Provide Sponsorship

Skills

mlops
ml ops
DevOps
Machine Learning Operations (ML Ops)
Python
TensorFlow
Terraform
Workflow
aws
gcp
Machine Learning (ML)
Kubernetes

Job Details

Job Title: ML Ops Lead
Location: New York (Hybrid)

Job Description

The ML Ops Lead drives the design, deployment, and optimization of machine learning solutions, balancing hands-on engineering with strategic leadership to enable robust, scalable, and maintainable AI infrastructure.

Key Responsibilities

  • Architect and maintain scalable ML infrastructure, self-service ML pipelines, and CI/CD workflows for model training and deployment.
  • Lead and mentor an MLOps team, fostering technical excellence and continual improvement.
  • Design high-scale distributed training and inference environments using cloud (AWS, Google Cloud Platform) and on-premises resources.
  • Build and manage feature stores, data ingestion, preprocessing, and validation pipelines.
  • Implement A/B testing, canary releases, monitoring, and rollback mechanisms for production ML models.
  • Ensure compliance with data governance, privacy, and security standards; manage role-based access controls for ML infrastructure.
  • Collaborate with data scientists, software engineers, DevOps, and product teams to bring models from experimentation to enterprise-grade production.

Required Skills and Experience

  • Deep expertise in creating and managing machine learning infrastructure and orchestration frameworks (e.g., Kubeflow, MLflow, Airflow).
  • Proficiency in cloud platforms (AWS, Google Cloud Platform), Kubernetes, Terraform, and distributed computing.
  • Having databricks MLflow knowledge.
  • Excellent skills in Python and ML frameworks (TensorFlow, TorchServe), CI/CD automation, and pipeline management.
  • Strong analytical, problem-solving, and project management abilities.
  • Demonstrated ability to build, scale, and lead technical teams.
  • Solid understanding of data compliance, governance, and model monitoring.
  • Master s degree in a technical field (Computer Science, Data Science, ML, or equivalent).

Desired Qualifications

  • Experience optimizing GPU/TPU utilization and large-scale storage solutions.
  • Track record in designing robust monitoring systems for model drift, downtime, and performance.
  • Familiarity with the challenges of deploying models in real-time, multi-cloud, or edge environments.
  • Ability to innovate and continuously improve workflows, combining ML and human computation.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.