Apply Now

Machine Learning Inference Engineer

San Francisco, CA, US • Posted 20 hours ago • Updated 56 minutes ago

Full Time

On-site

$200000 - $210000 per annum

Fitment

Dice Job Match Score™

🔗 Matching skills to job...

Job Details

Skills

Machine Learning Inference Engineer

Summary

We are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.

This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.

The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.

What You'll Work On
Build and optimize high-performance inference-serving systems for multimodal and generative AI models
Improve latency, throughput, scalability, and GPU utilization for production AI workloads
Productionize large PyTorch-based models for real-world deployment environments
Design and maintain model-serving microservices and distributed inference infrastructure
Optimize inference pipelines using:
TensorRT
Triton Inference Server
vLLM
CUDA/GPU acceleration techniques

Work on:
KV cache optimization
model pruning
quantization
distillation
batching strategies
memory optimization
latent-space conditioning
Deploy and scale multimodal architectures including:
diffusion models
vision-language models (VLMs)
large vision pipelines
Collaborate closely with research and product engineering teams to balance:
model quality
latency
infrastructure cost
production reliability
Own the full inference optimization lifecycle from experimentation to production deployment

Ideal Background
Strong experience building and optimizing AI inference systems in production
Deep understanding of GPU architecture and performance optimization
Hands-on expertise with:
Python
PyTorch
CUDA
TensorRT
Triton
vLLM

Experience with multimodal AI, computer vision, or generative AI systems
Familiarity with diffusion models or large-scale vision pipelines is strongly preferred
Strong understanding of model deployment tradeoffs:
throughput vs latency
memory efficiency
model quality vs compute cost
Experience working with distributed inference systems and scalable serving infrastructure
Comfortable operating in highly autonomous, fast-moving startup environments

Nice to Have
Experience with:
diffusion model optimization
multimodal transformers
quantization techniques
FlashAttention
TensorRT-LLM
speculative decoding
model parallelism
Kubernetes-based ML infrastructure
Contributions to open source AI infrastructure projects
Publications, patents, or research experience in AI systems, vision, or generative modeling

Why This Opportunity
Work on cutting-edge multimodal and generative AI systems deployed at scale
Significant ownership and autonomy across core AI infrastructure
Opportunity to solve complex GPU inference and scaling challenges
High-impact engineering role with direct visibility into product performance
Fast-moving environment with strong technical talent density
Opportunity to contribute to novel IP and patentable systems

80% covered healthcare, 401k 3% matching, $500 learning stipend, Global program- work anywhere in the world for 3 months

Oscar Associates Limited (US) is acting as an Employment Agency in relation to this vacancy.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10525742
Position Id: 0007689-830
Posted 20 hours ago

Contact the job poster

Andrea Alexander

Recruiter @ Oscar Technology

View Profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Engineering Manager (AI Inference)

San Francisco, California

•

Today

About the Role We are looking for an Inference Engineering Manager to lead our AI Inference team. This is a unique opportunity to build and scale the infrastructure that powers Perplexity's products and APIs, serving millions of users with state-of-the-art AI capabilities. You will own the technical direction and execution of our inference systems while building and leading a world-class team of inference engineers. Our current stack includes Python, PyTorch, Rust, C++, and Kubernetes. You wil

Full-time

USD 300,000.00 - 485,000.00 per year

Staff/Sr. Machine Learning Engineer, Foundation Models - AI, Search & Knowledge Platforms

San Francisco, California

•

Today

We are Foundation Model Inference Team, within AI, Search & Knowledge Platform Technologies organization. Our team is responsible to build Inference stack to power Apple Intelligence. It builds frameworks, services and tools that power the largest Apple foundation models on servers. Our Infrastructure powers a wide gamut of services at Apple including Apple Search, Apple Music, AppleTV, AppStore, iMessages, Photos & Camera, Spotlight, Safari, Siri and upcoming ever exciting Apple products servin

Full-time

Machine Learning Infrastructure Engineer

Remote or San Francisco, California

•

Today

Join the Future of Commerce with Whatnot! Whatnot is the largest livestream shopping platform in North America and Europe to buy, sell, and discover the things you love. Whether it's trading cards, fashion, electronics, or live plants, our sellers are building real businesses across hundreds of categories. We're building live commerce at a scale that's never been done in the West, and there's no playbook to copy. The people here are shaping how an entirely new industry develops. As a remote co

Full-time

USD 200,000.00 - 345,000.00 per year

Senior Machine Learning Infrastructure Engineer

San Francisco, California

•

Today

The opportunity Unity is looking for a Senior Machine Learning Infrastructure Engineer to join our Vector Ads team, where we build the real-time systems that power Unity's global advertising platform. This is a high-scale, low-latency environment - processing billions of requests daily to deliver fast, relevant ads to players around the world. You'll build and operate the infrastructure that brings ML models from training into production, ensuring our ranking, bidding, and targeting systems run

Full-time

USD 183,700.00 - 248,600.00 per year

Search all similar jobs