ML Ops Lead

Overview

Hybrid

Depends on Experience

Full Time

Accepts corp to corp applications

Able to Provide Sponsorship

Skills

mlops

ml ops

DevOps

Machine Learning Operations (ML Ops)

Python

TensorFlow

Terraform

Workflow

aws

gcp

Machine Learning (ML)

Kubernetes

Job Details

Job Title: ML Ops Lead
Location: New York (Hybrid)

Job Description

The ML Ops Lead drives the design, deployment, and optimization of machine learning solutions, balancing hands-on engineering with strategic leadership to enable robust, scalable, and maintainable AI infrastructure.

Key Responsibilities

Architect and maintain scalable ML infrastructure, self-service ML pipelines, and CI/CD workflows for model training and deployment.
Lead and mentor an MLOps team, fostering technical excellence and continual improvement.
Design high-scale distributed training and inference environments using cloud (AWS, Google Cloud Platform) and on-premises resources.
Build and manage feature stores, data ingestion, preprocessing, and validation pipelines.
Implement A/B testing, canary releases, monitoring, and rollback mechanisms for production ML models.
Ensure compliance with data governance, privacy, and security standards; manage role-based access controls for ML infrastructure.
Collaborate with data scientists, software engineers, DevOps, and product teams to bring models from experimentation to enterprise-grade production.

Required Skills and Experience

Deep expertise in creating and managing machine learning infrastructure and orchestration frameworks (e.g., Kubeflow, MLflow, Airflow).
Proficiency in cloud platforms (AWS, Google Cloud Platform), Kubernetes, Terraform, and distributed computing.
Having databricks MLflow knowledge.
Excellent skills in Python and ML frameworks (TensorFlow, TorchServe), CI/CD automation, and pipeline management.
Strong analytical, problem-solving, and project management abilities.
Demonstrated ability to build, scale, and lead technical teams.
Solid understanding of data compliance, governance, and model monitoring.
Master s degree in a technical field (Computer Science, Data Science, ML, or equivalent).

Desired Qualifications

Experience optimizing GPU/TPU utilization and large-scale storage solutions.
Track record in designing robust monitoring systems for model drift, downtime, and performance.
Familiarity with the challenges of deploying models in real-time, multi-cloud, or edge environments.
Ability to innovate and continuously improve workflows, combining ML and human computation.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share