Machine Learning Inference Engineer

San Francisco, CA, US • Posted 20 hours ago • Updated 56 minutes ago
Full Time
On-site
$200000 - $210000 per annum
Fitment

Dice Job Match Score™

🔗 Matching skills to job...

Job Details

Skills

  • Machine Learning Inference Engineer

Summary



We are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.


This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.


The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.


What You'll Work On
Build and optimize high-performance inference-serving systems for multimodal and generative AI models
Improve latency, throughput, scalability, and GPU utilization for production AI workloads
Productionize large PyTorch-based models for real-world deployment environments
Design and maintain model-serving microservices and distributed inference infrastructure
Optimize inference pipelines using:
TensorRT
Triton Inference Server
vLLM
CUDA/GPU acceleration techniques


Work on:
KV cache optimization
model pruning
quantization
distillation
batching strategies
memory optimization
latent-space conditioning
Deploy and scale multimodal architectures including:
diffusion models
vision-language models (VLMs)
large vision pipelines
Collaborate closely with research and product engineering teams to balance:
model quality
latency
infrastructure cost
production reliability
Own the full inference optimization lifecycle from experimentation to production deployment


Ideal Background
Strong experience building and optimizing AI inference systems in production
Deep understanding of GPU architecture and performance optimization
Hands-on expertise with:
Python
PyTorch
CUDA
TensorRT
Triton
vLLM


Experience with multimodal AI, computer vision, or generative AI systems
Familiarity with diffusion models or large-scale vision pipelines is strongly preferred
Strong understanding of model deployment tradeoffs:
throughput vs latency
memory efficiency
model quality vs compute cost
Experience working with distributed inference systems and scalable serving infrastructure
Comfortable operating in highly autonomous, fast-moving startup environments


Nice to Have
Experience with:
diffusion model optimization
multimodal transformers
quantization techniques
FlashAttention
TensorRT-LLM
speculative decoding
model parallelism
Kubernetes-based ML infrastructure
Contributions to open source AI infrastructure projects
Publications, patents, or research experience in AI systems, vision, or generative modeling


Why This Opportunity
Work on cutting-edge multimodal and generative AI systems deployed at scale
Significant ownership and autonomy across core AI infrastructure
Opportunity to solve complex GPU inference and scaling challenges
High-impact engineering role with direct visibility into product performance
Fast-moving environment with strong technical talent density
Opportunity to contribute to novel IP and patentable systems



80% covered healthcare, 401k 3% matching, $500 learning stipend, Global program- work anywhere in the world for 3 months



Oscar Associates Limited (US) is acting as an Employment Agency in relation to this vacancy.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10525742
  • Position Id: 0007689-830
  • Posted 20 hours ago
Contact the job poster
AA

Andrea Alexander

Recruiter @ Oscar Technology
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

San Francisco, California

Today

Full-time

USD 300,000.00 - 485,000.00 per year

San Francisco, California

Today

Full-time

Remote or San Francisco, California

Today

Full-time

USD 200,000.00 - 345,000.00 per year

San Francisco, California

Today

Full-time

USD 183,700.00 - 248,600.00 per year

Search all similar jobs