Apply Now

AI Inference Engineer

San Jose, CA, US • Posted 3 days ago • Updated 3 days ago

Contract W2

6 Months

No Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

LLM Inference
Generative AI
Large Language Models (LLMs)
AI Inference Engineering
vLLM
SGLang
Triton Inference Server
TensorRT-LLM
TorchServe
KServe
CUDA
ROCm
GPU Optimization
GPU Kernel Development
Python
C++
Rust
Kubernetes
OpenShift
Helm
Distributed Systems
Multi-GPU Clusters
Tensor Parallelism
Pipeline Parallelism
Model Serving
AI Infrastructure
MLOps
Performance Tuning
Latency Optimization
Throughput Optimization
KV Cache Management
PagedAttention
Quantization
Continuous Batching
Speculative Decoding
Mixture of Experts (MoE)
LLM Serving
Distributed Computing
NVIDIA GPUs
AI Platform Engineering
OpenAI API
Observability
Telemetry
Benchmarking
Profiling
Microservices
Cloud Infrastructure
Production AI Systems.

Summary

AI Inference Engineer

Location: San Jose, CA

Contract / C2H

Duration: 6-12Months

About the Role

We are seeking a highly skilled AI Inference Engineer to join our team and drive the performance, scalability, and reliability of our large-scale model serving infrastructure. This role sits at the intersection of systems engineering, GPU optimization, and distributed infrastructure, and is ideal for someone who thrives on squeezing maximum performance out of production AI workloads.

The ideal candidate has hands-on experience building or operating production-grade inference serving systems and is comfortable working close to the hardware, from CUDA/ROCm kernels to distributed multi-node, multi-GPU clusters serving large language models at scale.

Key Responsibilities

Inference Serving & Optimization

· Build, operate, and optimize production model-serving stacks using frameworks such as vLLM, SGLang, Triton Inference Server, TensorRT-LLM, TorchServe, or KServe

· Develop and maintain custom high-throughput microservices for model inference using C++, Python, and Rust

GPU & Hardware Acceleration

· Write and optimize custom GPU kernels using CUDA, ROCm, or Triton

· Apply deep understanding of GPU architecture, including memory hierarchies and tensor cores, to improve compute efficiency

LLM Inference Internals

· Optimize prefill and decode stages, attention mechanisms, and continuous batching

· Implement and tune quantization, speculative decoding, tensor parallelism, pipeline parallelism, and Mixture of Experts (MoE) serving strategies

Memory & KV Cache Management

· Design and implement KV cache optimization strategies, including PagedAttention, chunked prefill, prefix caching, and quantized KV

· Develop cache transfer and offload strategies to manage memory pressure under high-volume, irregular workloads

Distributed Systems & Infrastructure

· Build and operate fault-tolerant, high-concurrency serving systems deployed on Kubernetes, OpenShift, Helm, or similar orchestration platforms

· Implement tensor parallelism, pipeline parallelism, and distributed computing across multi-node, multi-GPU clusters

Distributed Serving Platform (Dynamo)

· Contribute to distributed serving architecture components including frontend, router, worker discovery, multi-model routing, and health checks

· Build and maintain OpenAI-compatible endpoints across multiple backends, including SGLang, TensorRT-LLM, and vLLM

Performance & Reliability

· Conduct deep profiling and benchmarking to identify and resolve latency and throughput regressions

· Build telemetry-driven observability platforms ensuring high availability, load balancing, and dynamic request scheduling

Model Support

· Bring up and support a broad range of model classes in production, including decoder-only LLMs, MoE models, hybrid attention/SSM models, multimodal models, embedding models, reward models, and classification models

Required Qualifications

· Proven experience with production model-serving frameworks (vLLM, SGLang, Triton Inference Server, TensorRT-LLM, TorchServe, KServe, or custom runtimes)

· Strong proficiency in C++, Python, and Rust for building high-performance, memory-efficient systems

· Hands-on experience writing GPU kernels using CUDA and/or ROCm

· Solid understanding of LLM inference internals, including attention mechanisms, KV cache management, continuous batching, and quantization

· Experience with distributed, multi-node, multi-GPU serving environments

· Experience deploying and managing services on Kubernetes, OpenShift, or similar orchestration platforms

· Strong background in performance profiling, benchmarking, and debugging latency or throughput issues

Preferred Qualifications

· Direct experience working with NVIDIA Dynamo or similar distributed serving architectures (router, worker discovery, multi-model routing)

· Experience supporting diverse model types in production, including MoE, multimodal, and hybrid attention/SSM architectures

· Familiarity with OpenAI-compatible API design and implementation

· Experience with telemetry and observability tooling for large-scale GPU infrastructure

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10271950
Position Id: 8995339
Posted 3 days ago

Contact the job poster

Kumar Swentak

Account Manager @ Triune Infomatics Inc

View Profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

San Jose, California

•

Today

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the wo

Full-time

USD 210,000.00 per year

Staff ML Engineer, Inference Platform

Sunnyvale, California

•

Today

Job Description Hybrid This role is categorized as hybrid. This means the successful candidate is expected to report to the Sunnyvale Tecnical Center, CA at least three times per week, at minimum or other frequency dictated by the business. This job is eligible for relocation assistance. About the Team: The ML Inference Platform is part of the AI Compute Platforms organization within Infrastructure Platforms. Our team owns the cloud-agnostic, reliable, and cost-efficient platform that powers G

Full-time

USD 185,500.00 - 270,000.00 per year

Senior AI Engineer

San Jose, California

•

Today

Immigration sponsorship is not available for this position Responsibilities: Develop the Machine Learning Platform management system. Design and implement intuitive user interfaces and APls for seamless interaction with the platform. Ensure robust access control and security measures for the Machine Learning Platform. Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration. Build the t

Full-time

USD 209,000.00 - 275,400.00 per year

Senior ML Infrastructure Engineer, Inference Platform

Sunnyvale, California

•

Today

Job Description About the Team: The ML Inference Platform is part of the AV ML Infrastructure organization. Our team owns the cloud-agnostic, reliable, and cost-efficient platform that powers GM's AI efforts. We're proud to serve teams developing autonomous vehicles (L3/L4/L5), as well as other groups building AI-driven products for GM and its customers. We enable rapid innovation and feature development by optimizing for high-priority, ML-centric use cases. Our platform supports the serving o

Full-time

USD 155,420.00 - 205,900.00 per year

Search all similar jobs

AI Inference Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Kumar Swentak

Similar Jobs