Overview
Skills
Job Details
Position Overview
We’re looking for an experienced Machine Learning Infrastructure Engineer to help scale our ML training platform. You’ll design, build, and maintain large-scale ML infrastructure to accelerate model development and improve training performance across a growing GPU ecosystem. This role involves optimizing distributed training systems, managing high-performance computing environments, and ensuring reliability at scale.
Key Responsibilities
Infrastructure & Scalability: Design and implement large-scale ML training pipelines leveraging parallel GPU processing on Google Cloud Platform or AWS.
Performance & Distributed Systems: Optimize high-performance computing resources, address distributed system challenges (e.g., race conditions, memory optimization), and enhance training efficiency using techniques like mixed precision, ZeRO, and LoRA.
Job Scheduling & Reliability: Develop job scheduling, retries, and recovery systems to improve uptime and resource utilization.
Storage & Data Handling: Implement optimized local and networked storage solutions, and caching strategies to maximize data throughput.
Collaboration: Partner with ML researchers and data scientists to identify bottlenecks, monitor system performance, and continuously improve scalability.
Required Qualifications
Bachelor’s degree in Computer Science or related field.
7+ years of software experience, including 2+ years in a technical leadership role.
Proven expertise in distributed systems and high-performance ML infrastructure.
Hands-on experience with GPU cloud environments (AWS, Google Cloud Platform), job scheduling, and performance tuning.
Familiarity with PyTorch, TensorRT, Triton, and related ML frameworks.
Strong problem-solving and communication skills.
Preferred Qualifications
Experience with Kubernetes or infrastructure-as-code frameworks.