Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Redwood City, CA, US • Posted 5 hours ago • Updated 5 hours ago
Full Time
No Travel Required
On-site
$250000 - $320000/yr
Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

  • PyTorch
  • Kubernetes
  • GPU

Summary

Senior ML Infrastructure Engineer (PyTorch, Kubernetes, GPU Training)

Short Job Description

We are seeking a Senior ML Infrastructure Engineer to design and scale the infrastructure powering large-scale machine learning training workloads. In this role, you'll build high-performance GPU training platforms, optimize distributed training pipelines, and improve the developer experience for ML researchers.

Responsibilities:

  • Design and scale distributed ML training infrastructure for large GPU clusters.
  • Build and optimize training pipelines using PyTorch, DeepSpeed, and distributed training frameworks.
  • Develop and maintain job scheduling systems using Kubernetes and/or SLURM.
  • Create high-throughput data pipelines for large-scale multimodal datasets.
  • Optimize GPU utilization, memory efficiency, and overall system performance.
  • Build low-latency inference pipelines for production ML deployments.

Required Skills:

  • 7+ years of experience in ML Infrastructure, HPC, or Distributed Systems.
  • Strong experience with PyTorch, DeepSpeed, FSDP, ZeRO, or similar distributed training frameworks.
  • Hands-on experience with Kubernetes, cloud platforms (AWS/Google Cloud Platform), and containerized environments.
  • Strong understanding of distributed systems, GPU optimization, NCCL, memory management, and performance tuning.
  • Experience building scalable ML infrastructure from development through production.

Location: Redwood City, CA (On-site)
Employment Type: Full-Time

Nice to Have:

  • Experience with multimodal AI, robotics data pipelines, Triton, TensorRT, custom ML kernels, or ML compiler/runtime optimization.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91172168
  • Position Id: 127-44139-
  • Posted 5 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Redwood City, California

Today

Full-time

Mountain View, California

Today

Full-time

USD 209,700.00 - 283,800.00 per year

Mountain View, California

Today

Full-time

USD 193,930.00 - 291,150.00 per year

Mountain View, California

Today

Full-time

USD 160,360.00 - 240,540.00 per year

Search all similar jobs