Overview
Skills
Job Details
Summary
We re looking for a Platform Engineer to design, build, and optimize scalable distributed compute infrastructure using Ray. This role focuses on enabling advanced ML, analytics, and data processing workloads across our Iceberg-based lakehouse.
Key Responsibilities
Design and implement distributed compute infrastructure using Ray to support large-scale machine learning and data processing workloads.
Develop and optimize scalable training, inference, and batch processing pipelines integrated with the lakehouse.
Work closely with data scientists and platform teams to provide high-performance, cost-optimized compute capabilities.
Implement autoscaling, resource management, and job orchestration patterns for Ray clusters.
Contribute to integration with other components (e.g., Trino, Airflow, Iceberg) for seamless data access and processing.
Required Skills
5+ years in distributed systems, ML infrastructure, or data engineering roles.
2+ years of hands-on experience with Ray in production environments.
Strong background in Python, distributed compute frameworks, and model deployment strategies.
Familiarity with data lakehouse architectures and integration with storage/query engines.
Experience with Kubernetes, container orchestration, and autoscaling strategies.
Understanding of MLOps concepts and CI/CD for ML pipelines.