ML Platform Engineer

Overview

Remote
Depends on Experience
Full Time
10% Travel

Skills

Analytics
Artificial Intelligence
Batch Processing
Continuous Delivery
Continuous Integration
Data Engineering
Data Processing
Kubernetes
Machine Learning (ML)
Machine Learning Operations (ML Ops)
Python
Orchestration
Resource Management
Storage

Job Details

Summary

We re looking for a Platform Engineer to design, build, and optimize scalable distributed compute infrastructure using Ray. This role focuses on enabling advanced ML, analytics, and data processing workloads across our Iceberg-based lakehouse.

Key Responsibilities

  • Design and implement distributed compute infrastructure using Ray to support large-scale machine learning and data processing workloads.

  • Develop and optimize scalable training, inference, and batch processing pipelines integrated with the lakehouse.

  • Work closely with data scientists and platform teams to provide high-performance, cost-optimized compute capabilities.

  • Implement autoscaling, resource management, and job orchestration patterns for Ray clusters.

  • Contribute to integration with other components (e.g., Trino, Airflow, Iceberg) for seamless data access and processing.

Required Skills

  • 5+ years in distributed systems, ML infrastructure, or data engineering roles.

  • 2+ years of hands-on experience with Ray in production environments.

  • Strong background in Python, distributed compute frameworks, and model deployment strategies.

  • Familiarity with data lakehouse architectures and integration with storage/query engines.

  • Experience with Kubernetes, container orchestration, and autoscaling strategies.

  • Understanding of MLOps concepts and CI/CD for ML pipelines.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.