Overview
Skills
Job Details
MLOps / LLM Ops Engineer
Location: NYC, NY (Hybrid)
Job Overview
We are looking for an experienced MLOps / LLM Ops Engineer with a strong background in deploying and managing machine learning and large language model (LLM) pipelines. The ideal candidate will have 7+ years of experience in MLOps, with expertise in setting up end-to-end ML/LLM pipelines using open-source tools and cloud-native solutions on platforms like AWS, Google Cloud Platform, and Azure. This role requires hands-on knowledge in deploying, automating, and monitoring ML/LLM workflows, with a solid grounding in DevOps practices to ensure seamless CI/CD processes.
Key Responsibilities
- Pipeline Design & Implementation:
- Design, build, and manage MLOps and LLM Ops pipelines for data ingestion, model training, validation, deployment, and monitoring.
- Use open-source tools such as MLflow, Kubeflow, DVC, and Airflow to automate and monitor machine learning workflows.
- Implement scalable LLM-specific solutions for model training and inference, optimizing for resource allocation and deployment efficiency.
- Cloud-native MLOps Implementation:
- Set up and manage MLOps pipelines in AWS (SageMaker, EKS, Lambda, S3), Google Cloud Platform (Vertex AI, AI Platform Pipelines), and Azure (Machine Learning, AKS, Azure Functions).
- Design model versioning, retraining, and deployment workflows on cloud platforms to ensure consistent performance and availability.
- Implement CI/CD pipelines for ML models with GitHub Actions, Jenkins, or GitLab CI.
- Model Monitoring & Performance Optimization:
- Monitor models in production using Prometheus, Grafana, and TensorBoard, establishing observability metrics for model drift, accuracy, and latency.
- Collaborate with Data Engineering and ML teams to implement scalable and efficient data pipelines using Spark, Apache Beam, or BigQuery.
- Use A/B testing and shadow deployment strategies to validate and optimize LLM model performance in real-time.
- LLM-specific Model Operations:
- Deploy and monitor LLMs for specific tasks, ensuring they adhere to performance SLAs and are optimized for cost.
- Develop techniques for fine-tuning, optimizing inference, and managing infrastructure costs for large LLMs.
Required Skills and Qualifications
- Experience:
- 7+ years of experience in MLOps, DevOps, or ML Engineering.
- Strong track record in managing and deploying large-scale ML and LLM models in production environments.
- Technical Skills:
- Proficiency with Kubernetes and Docker for container orchestration and model deployment.
- Experience with open-source MLOps tools (MLflow, Kubeflow, DVC) and data versioning.
- Hands-on experience with cloud-native ML tools in AWS, Google Cloud Platform, or Azure and associated ML services.
- Knowledge of Python or Bash scripting for automating processes and custom integrations.
- DevOps-Related Skills:
- Solid understanding of CI/CD practices and tools like GitHub Actions, Jenkins, or GitLab CI/CD to build and deploy ML/LLM models.
- Proficient in infrastructure-as-code tools, such as Terraform or Ansible, to enable automated provisioning and configuration management.
- ML/LLM Specific Knowledge:
- Familiarity with the LLM lifecycle, including fine-tuning, tokenization, model serving, and large-scale NLP.
- Knowledge of transformers and other deep learning architectures for NLP/LLM tasks.
Behavioral Skills
- Excellent communication skills to convey complex technical details to both technical and non-technical audiences.
- Proactive problem-solving attitude, with the ability to adapt to emerging MLOps and LLMOps best practices.
- Ability to work independently and collaborate effectively with cross-functional teams, including data science, engineering, and DevOps.
Educational Qualifications
- Bachelor s or Master s degree in Computer Science, Data Science, AI/ML, or a related field.
Certifications in cloud platforms (AWS, Google Cloud Platform, Azure) or MLOps frameworks are a plus.