Machine Learning Infrastructure Engineer

  • Redwood City, CA
  • Posted 13 hours ago | Updated 13 hours ago

Overview

On Site
200000 - 250000
Full Time
No Travel Required
Unable to Provide Sponsorship

Skills

Machine Learning (ML)
PyTorch
Kubernetes
GPU
Computer Science

Job Details

Position Overview
We’re looking for an experienced Machine Learning Infrastructure Engineer to help scale our ML training platform. You’ll design, build, and maintain large-scale ML infrastructure to accelerate model development and improve training performance across a growing GPU ecosystem. This role involves optimizing distributed training systems, managing high-performance computing environments, and ensuring reliability at scale.

Key Responsibilities

  • Infrastructure & Scalability: Design and implement large-scale ML training pipelines leveraging parallel GPU processing on Google Cloud Platform or AWS.

  • Performance & Distributed Systems: Optimize high-performance computing resources, address distributed system challenges (e.g., race conditions, memory optimization), and enhance training efficiency using techniques like mixed precision, ZeRO, and LoRA.

  • Job Scheduling & Reliability: Develop job scheduling, retries, and recovery systems to improve uptime and resource utilization.

  • Storage & Data Handling: Implement optimized local and networked storage solutions, and caching strategies to maximize data throughput.

  • Collaboration: Partner with ML researchers and data scientists to identify bottlenecks, monitor system performance, and continuously improve scalability.

Required Qualifications

  • Bachelor’s degree in Computer Science or related field.

  • 7+ years of software experience, including 2+ years in a technical leadership role.

  • Proven expertise in distributed systems and high-performance ML infrastructure.

  • Hands-on experience with GPU cloud environments (AWS, Google Cloud Platform), job scheduling, and performance tuning.

  • Familiarity with PyTorch, TensorRT, Triton, and related ML frameworks.

  • Strong problem-solving and communication skills.

Preferred Qualifications

  • Experience with Kubernetes or infrastructure-as-code frameworks.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.