Apply Now

ML Platform Engineer - GPU Infrastructure

Warren, MI, US • Posted 8 days ago • Updated 11 hours ago

Full Time

On-site

Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

DevOps
Systems Engineering
Computer Science
Computer Engineering
Training
Management
Storage
Scalability
Collaboration
Linux
Kubernetes
Docker
Continuous Integration
Continuous Delivery
Scripting
Python
Bash
Artificial Intelligence
Machine Learning (ML)
GPU
Performance Tuning
SIM
Cloud Computing
Amazon Web Services
Microsoft Azure
Google Cloud
Google Cloud Platform
Grafana

Summary

Job Title: ML Platform Engineer - GPU Infrastructure

Job Summary
Support team by designing, implementing, and maintaining the automation and ML workload enablement layer of the GPU cluster platform. This role focuses on optimizing GPU compute environments for AI/ML training and Isaac Sim simulation workloads, integrating GPU jobs into CI/CD pipelines, standardizing runtime environments, and supporting reliable storage and artifact management.

Required Experience
3+ years of experience in ML Platform Engineering, DevOps, Infrastructure Engineering, or related field
Bachelor's or Master's degree in Systems Engineering, Computer Science, Computer Engineering, or related discipline

Responsibilities
Support GPU cluster platforms for AI/ML and simulation workloads
Optimize GPU compute environments for ML training and Isaac Sim execution
Integrate GPU workload execution into CI/CD pipelines
Standardize runtime environments using containers and automation tools
Manage storage, artifacts, and workload outputs
Troubleshoot and improve platform reliability, scalability, and performance
Collaborate with ML, infrastructure, and engineering teams

Required Skills
Experience with Linux, Kubernetes, Docker, and GPU infrastructure
Knowledge of CI/CD tools and automation scripting (Python/Bash)
Experience supporting AI/ML workloads and distributed systems
Familiarity with NVIDIA GPU technologies and containerized environments
Strong troubleshooting and performance optimization skills

Preferred Skills
Experience with Isaac Sim or simulation workloads
Exposure to cloud platforms (AWS, Azure, or Google Cloud Platform)
Knowledge of monitoring and observability tools such as Grafana or Prometheus

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10382565
Position Id: d423996a318b9b1b15763cd201fadb4c
Posted 8 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

GPU Platform Infrastructure Engineer

Warren, Michigan

•

Today

Job Title: GPU Platform Infrastructure Engineer Job Summary Support the GM ARC RTD team by building and maintaining the foundational GPU cluster platform infrastructure supporting shared AI/ML, simulation, and validation workloads. This role focuses on GPU access governance, resource allocation, scheduling policies, observability, and operational support for multi-tenant GPU environments including RTX 6000, A100, B200, and future systems. Required Experience 3+ years of experience in Platform

Full-time

Machine Learning Engineer II - Autonomous Driving Training Infrastructure

Remote

•

Today

May Mobility is transforming cities through autonomous technology to create a safer, greener, more accessible world. Based in Ann Arbor, Michigan, May develops and deploys autonomous vehicles (AVs) powered by our innovative Multi-Policy Decision Making (MPDM) technology that literally reimagines the way AVs think. Our vehicles do more than just drive themselves - they provide value to communities, bridge public transit gaps and move people where they need to go safely, easily and with a lot mor

Full-time

USD 160,000.00 - 210,000.00 per year

Machine Learning Infrastructure Engineer, GenAI Technology

No location provided

•

Today

A Career with Point72's Technology Team As Point72 reimagines the future of investing, our Technology team is constantly evolving our firm's IT infrastructure and engineering capabilities, positioning us at the forefront of a rapidly evolving technology landscape. We're a team of experts who experiment and work to discover new ways to harness open-source solutions, modern cloud architectures, and sophisticated Artificial Intelligence (AI) solutions, while embracing enterprise agile methodologie

Full-time

USD 180,000.00 - 300,000.00 per year

Senior Machine Learning Operations Engineer II (AI Native)

Remote

•

Today

About Life360 Life360's mission is to keep people close to the ones they love. Our category-leading mobile app,Tile tracking devices, and Pet GPS tracker empower members to protect the people, pets, and things they care about most with a range of services, including location sharing, safe driver reports, and crash detection with emergency dispatch. Life360 serves approximately 97.8 million monthly active users (MAU), as of March 31, 2026, across more than 180 countries. Life360 delivers peace o

Full-time

USD 148,000.00 - 216,000.00 per year

Search all similar jobs