MLOps / LLM Ops Engineer

  • New York, NY
  • Posted 31 days ago | Updated 31 days ago

Overview

Hybrid
$120,000 - $160,000
Full Time

Skills

MLOPS
Cloud
LLM

Job Details

MLOps / LLM Ops Engineer

Location: NYC, NY (Hybrid)

Job Overview

We are looking for an experienced MLOps / LLM Ops Engineer with a strong background in deploying and managing machine learning and large language model (LLM) pipelines. The ideal candidate will have 7+ years of experience in MLOps, with expertise in setting up end-to-end ML/LLM pipelines using open-source tools and cloud-native solutions on platforms like AWS, Google Cloud Platform, and Azure. This role requires hands-on knowledge in deploying, automating, and monitoring ML/LLM workflows, with a solid grounding in DevOps practices to ensure seamless CI/CD processes.

Key Responsibilities

  • Pipeline Design & Implementation:
    • Design, build, and manage MLOps and LLM Ops pipelines for data ingestion, model training, validation, deployment, and monitoring.
    • Use open-source tools such as MLflow, Kubeflow, DVC, and Airflow to automate and monitor machine learning workflows.
    • Implement scalable LLM-specific solutions for model training and inference, optimizing for resource allocation and deployment efficiency.
  • Cloud-native MLOps Implementation:
    • Set up and manage MLOps pipelines in AWS (SageMaker, EKS, Lambda, S3), Google Cloud Platform (Vertex AI, AI Platform Pipelines), and Azure (Machine Learning, AKS, Azure Functions).
    • Design model versioning, retraining, and deployment workflows on cloud platforms to ensure consistent performance and availability.
    • Implement CI/CD pipelines for ML models with GitHub Actions, Jenkins, or GitLab CI.
  • Model Monitoring & Performance Optimization:
    • Monitor models in production using Prometheus, Grafana, and TensorBoard, establishing observability metrics for model drift, accuracy, and latency.
    • Collaborate with Data Engineering and ML teams to implement scalable and efficient data pipelines using Spark, Apache Beam, or BigQuery.
    • Use A/B testing and shadow deployment strategies to validate and optimize LLM model performance in real-time.
  • LLM-specific Model Operations:
    • Deploy and monitor LLMs for specific tasks, ensuring they adhere to performance SLAs and are optimized for cost.
    • Develop techniques for fine-tuning, optimizing inference, and managing infrastructure costs for large LLMs.

Required Skills and Qualifications

  • Experience:
    • 7+ years of experience in MLOps, DevOps, or ML Engineering.
    • Strong track record in managing and deploying large-scale ML and LLM models in production environments.
  • Technical Skills:
    • Proficiency with Kubernetes and Docker for container orchestration and model deployment.
    • Experience with open-source MLOps tools (MLflow, Kubeflow, DVC) and data versioning.
    • Hands-on experience with cloud-native ML tools in AWS, Google Cloud Platform, or Azure and associated ML services.
    • Knowledge of Python or Bash scripting for automating processes and custom integrations.
  • DevOps-Related Skills:
    • Solid understanding of CI/CD practices and tools like GitHub Actions, Jenkins, or GitLab CI/CD to build and deploy ML/LLM models.
    • Proficient in infrastructure-as-code tools, such as Terraform or Ansible, to enable automated provisioning and configuration management.
  • ML/LLM Specific Knowledge:
    • Familiarity with the LLM lifecycle, including fine-tuning, tokenization, model serving, and large-scale NLP.
    • Knowledge of transformers and other deep learning architectures for NLP/LLM tasks.

Behavioral Skills

  • Excellent communication skills to convey complex technical details to both technical and non-technical audiences.
  • Proactive problem-solving attitude, with the ability to adapt to emerging MLOps and LLMOps best practices.
  • Ability to work independently and collaborate effectively with cross-functional teams, including data science, engineering, and DevOps.

Educational Qualifications

  • Bachelor s or Master s degree in Computer Science, Data Science, AI/ML, or a related field.

Certifications in cloud platforms (AWS, Google Cloud Platform, Azure) or MLOps frameworks are a plus.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.