Apply Now

Machine Learning Infrastructure Engineer

San Jose, CA, US • Posted 5 hours ago • Updated 5 hours ago

Full Time

On-site

USD $140,000.00 - 165,000.00 per year

Fitment

Dice Job Match Score™

📋 Comparing job requirements...

Job Details

Skills

Ethernet
PCI Express
Semiconductors
Cosmos
Recruiting
Backbone.js
Systems Modeling
SAFE
Evaluation
Regression Analysis
Dashboard
Economics
Caching
Failover
Optimization
Software Engineering
Machine Learning Operations (ML Ops)
Python
Amazon Web Services
Google Cloud Platform
Google Cloud
Machine Learning (ML)
Reliability Engineering
Incident Management
Workflow
Debugging
Routing
Access Control
Artificial Intelligence
Innovation

Summary

Astera Labs (NASDAQ: ALAB) provides rack-scale AI infrastructure through purpose-built connectivity solutions. By collaborating with hyperscalers and ecosystem partners, Astera Labs enables organizations to unlock the full potential of modern AI. Astera Labs' Intelligent Connectivity Platform integrates CXL , Ethernet, NVLink, PCIe , and UALink semiconductor-based technologies with the company's COSMOS software suite to unify diverse components into cohesive, flexible systems that deliver end-to-end scale-up, and scale-out connectivity. The company's custom connectivity solutions business complements its standards-based portfolio, enabling customers to deploy tailored architectures to meet their unique infrastructure requirements. Discover more at ;br>
Machine Learning Infrastructure Engineer

Location: San Jose, CA
Experience: 1-5 years
Team: Applied AI
The role

We're hiring a Machine Learning Infrastructure Engineer to build the runtime, platform, and operational backbone for modern AI systems. This role is for someone who wants to work on the systems behind the systems: model access layers, routing, serving paths, telemetry, observability, evaluation infrastructure, and the controls needed to make fast-moving AI work reliable in practice.

This is a platform role, but not in the old sense. The work is tightly coupled to how modern AI systems are actually built and used: multiple model providers, agent runtimes, skill and tool layers, inference telemetry, cost-aware routing, AI spend visibility, and governance that is strong enough for real internal adoption.

What you'll do

Build and improve internal AI infrastructure for LLM applications, agents, retrieval systems, and model-backed engineering workflows.
Own inference deployment paths across managed and self-serve environments, including access control, monitoring, and operational reliability.
Build platform layers such as model gateways, routing, runtime integrations, telemetry, and controls for safe execution at scale.
Develop AI Ops capabilities across evaluation, release readiness, observability, incident triage, regression detection, and cost monitoring.
Build dashboards, tracing, logging, and alerting for production AI systems, including spend and usage visibility across tools and teams.
Improve performance and unit economics through routing, caching, batching, failover, and latency/cost optimization.
Create reusable APIs, SDKs, and platform abstractions that make AI systems easier to deploy, evaluate, govern, and operate.

What we're looking for

1-5 years of experience in software engineering, ML infrastructure, MLOps, platform engineering, or related backend/infrastructure roles.
Strong Python plus strong systems instincts.
Experience with AWS or Google Cloud Platform and real production service ownership.
Familiarity with inference deployments, model APIs, gateways, serving systems, or runtime infrastructure for LLM/ML workloads.
Experience with observability, telemetry, reliability engineering, and incident response.
Understanding of eval systems, release workflows, retrieval-backed systems, and debugging non-deterministic AI behavior.
Ability to translate messy platform needs into scalable internal infrastructure.

What strong candidates often look like

They have built or operated systems where latency, routing, cost, telemetry, and reliability actually matter. They understand that modern AI infrastructure is not just about getting a model endpoint running. It is about building the runtime, visibility, controls, and developer experience that let an applied AI team move fast without losing quality or trust.

Why this role is interesting

The team is building AI-ready infrastructure in the most literal sense: observability, access control, AI spend tracking, secure managed platforms, skill/tool infrastructure, and telemetry that spans requests, tools, models, and outcomes. If you want to work on the platform layer that makes modern agentic systems possible - and do it in a setting where the downstream users are serious engineers with high expectations - this is that role.

The base pay compensation range for this role is between $140,000 - $165,000

We know that creativity and innovation happen more often when teams include diverse ideas, backgrounds, and experiences, and we actively encourage everyone with relevant experience to apply, including people of color, LGBTQ+ and non-binary people, veterans, parents, and individuals with disabilities.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91133767
Position Id: d15dee3c3a333fa3e126a0054cc420d
Posted 5 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

AI/ML Engineer

San Jose, California

•

Today

Full-time

USD 140,000.00 - 165,000.00 per year

Machine Learning (MLOps) Engineer

Cupertino, California

•

Today

As an MLOps Engineer, you will be the backbone of our machine learning infrastructure, ensuring that AI/ML systems are reliable, scalable, and continuously improving in production. You will bridge the gap between data science and engineering, driving operational excellence across the full ML lifecycle. The MLOps Engineer will drive end-to-end quality initiatives across data ingestion, model training, deployment pipelines, and MLOps tooling. This hire will build, deploy, and optimize AI/ML based

Full-time

Staff Software Engineer - AI/ML Systems and Reliability

San Jose, California

•

Today

Adobe is looking for a Staff Software Engineer - AI/ML Systems, MLOps & Reliability to help build and scale the platform powering Adobe Experience Platform's Personalization ML solutions and Generative AI capabilities. This role sits at the intersection of software engineering, MLOps, infrastructure, and reliability engineering. You will help design and operate the foundational platform that enables scalable model training, reliable inference, automated ML workflows, and production-grade AI syst

Full-time

USD 159,200.00 per year

Software Engineer, AI Infrastructure

Mountain View, California

•

Today

About Glean: Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor

Full-time

USD 175,000.00 - 270,000.00 per year

Search all similar jobs

More jobs at Astera Labs in San Jose, CA