![]()
We are partnering with a fast-growing AI startup building next-generation multimodal generative systems focused on highly realistic visual experiences at scale. The company operates at the intersection of computer vision, generative AI, and real-time inference infrastructure, developing advanced AI products used by enterprise customers across large consumer-facing industries.
This is a highly technical and hands-on engineering role focused on production inference optimization for multimodal and generative AI systems. The ideal candidate will have deep expertise in GPU inference, model serving, PyTorch-based deployment, and performance optimization for large-scale AI applications.
The role offers significant ownership across infrastructure, inference systems, and production model optimization, with opportunities to contribute to novel AI system design and scalable deployment architectures.
What You'll Work On
Build and optimize high-performance inference-serving systems for multimodal and generative AI models
Improve latency, throughput, scalability, and GPU utilization for production AI workloads
Productionize large PyTorch-based models for real-world deployment environments
Design and maintain model-serving microservices and distributed inference infrastructure
Optimize inference pipelines using:
TensorRT
Triton Inference Server
vLLM
CUDA/GPU acceleration techniques
Work on:
KV cache optimization
model pruning
quantization
distillation
batching strategies
memory optimization
latent-space conditioning
Deploy and scale multimodal architectures including:
diffusion models
vision-language models (VLMs)
large vision pipelines
Collaborate closely with research and product engineering teams to balance:
model quality
latency
infrastructure cost
production reliability
Own the full inference optimization lifecycle from experimentation to production deployment
Ideal Background
Strong experience building and optimizing AI inference systems in production
Deep understanding of GPU architecture and performance optimization
Hands-on expertise with:
Python
PyTorch
CUDA
TensorRT
Triton
vLLM
Experience with multimodal AI, computer vision, or generative AI systems
Familiarity with diffusion models or large-scale vision pipelines is strongly preferred
Strong understanding of model deployment tradeoffs:
throughput vs latency
memory efficiency
model quality vs compute cost
Experience working with distributed inference systems and scalable serving infrastructure
Comfortable operating in highly autonomous, fast-moving startup environments
Nice to Have
Experience with:
diffusion model optimization
multimodal transformers
quantization techniques
FlashAttention
TensorRT-LLM
speculative decoding
model parallelism
Kubernetes-based ML infrastructure
Contributions to open source AI infrastructure projects
Publications, patents, or research experience in AI systems, vision, or generative modeling
Why This Opportunity
Work on cutting-edge multimodal and generative AI systems deployed at scale
Significant ownership and autonomy across core AI infrastructure
Opportunity to solve complex GPU inference and scaling challenges
High-impact engineering role with direct visibility into product performance
Fast-moving environment with strong technical talent density
Opportunity to contribute to novel IP and patentable systems
80% covered healthcare, 401k 3% matching, $500 learning stipend, Global program- work anywhere in the world for 3 months
Oscar Associates Limited (US) is acting as an Employment Agency in relation to this vacancy.