Full Stack Software Engineer - ML Compute Capacity

Santa Clara, CA, US • Posted 30+ days ago • Updated 4 hours ago
Full Time
On-site
Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

  • High Performance Computing
  • Collaboration
  • Optimization
  • Algorithms
  • Management
  • Distributed Computing
  • Knowledge Sharing
  • Python
  • Data Engineering
  • PostgreSQL
  • Elasticsearch
  • RESTful
  • Cloud Storage
  • React.js
  • Grafana
  • Conflict Resolution
  • Problem Solving
  • Mathematics
  • Economics
  • Kubernetes
  • Scheduling
  • Resource Management
  • Debugging
  • Machine Learning (ML)
  • Training
  • Capacity Management

Summary

Scaling machine learning workloads across thousands of accelerators creates challenges that few engineers ever encounter. In Apple's Machine Learning Platform Technologies organization, we build the infrastructure that powers large-scale ML training and inference workloads, bringing together expertise in distributed systems, machine learning infrastructure, and high-performance computing.

Description

As a senior engineer on the ML Compute Capacity team, you will design, build, and operate the production systems that ensure compute resources are optimally distributed throughout the company. You'll work across the stack - from data pipelines and backend services to APIs and interactive frontends - developing telemetry systems, optimization algorithms, policies, and intuitive tools for managing demand and improving efficiency across Apple's largest accelerator fleet. Our small, nimble team works in a high-autonomy, fast-paced environment, and we're passionate about digging into data patterns, laying out the performance characteristics of an entire distributed system, and knowledge sharing. If the opportunity to own and operate services that scale, stay highly available, and "just work" excites you, then please reach out to us!

Minimum Qualifications

5+ years of experience in relevant areas

Proficiency in Python for production backend and data engineering work

Experience building data pipelines and crafting robust queries over large-scale, multi-source data (e.g., Trino, PostgreSQL, Elasticsearch)

Experience designing and building RESTful APIs and working with cloud storage technologies

Experience with modern web frameworks like React

Experience with observability tools (e.g., Prometheus, Grafana) or equivalent monitoring systems

Excellent problem-framing and problem-solving skills

Strong CS fundamentals

Bachelor's degree or higher in Engineering, Mathematics, Economics, or a related quantitative field

Preferred Qualifications

Experience operating Kubernetes at production scale - including scheduling, resource management, and cluster debugging

Familiarity with accelerator utilization patterns across ML training and inference

Strong interest with capacity planning, cost attribution, or FinOps systems
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90733111
  • Position Id: e1d74a1f5d721a5d39a2ceba40022280
  • Posted 30+ days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Cupertino, California

Today

Full-time

Cupertino, California

Today

Full-time

Sunnyvale, California

Today

Full-time

USD 155,420.00 per year

Mountain View, California

Today

Full-time

USD 192,600.00 - 305,600.00 per year

Search all similar jobs