Overview
Skills
Job Details
Job Description:
>> Looking for devs with general cloud services distributed services experience, with LLM experience as a secondary skill. GPU experience is now low on the list of preferred skills: Dedicated Inference Service
Requirements:
>> Deep experience building services in modern cloud environments on distributed systems (i.e., containerization (Kubernetes, Docker), infrastructure as code, CICD pipelines, APIs, authentication and authorization, data storage, deployment, logging, monitoring, alerting, etc.)
>> Experience working with Large Language Models (LLMs), particularly hosting them to run inference
>> Strong verbal and written communication skills. Your job will involve communicating with local and remote colleagues about technical subjects and writing detailed documentation.
>> Experience with building or using benchmarking tools for evaluating LLM inference for various models, engine, and GPU combinations.
>> Familiarity with various LLM performance metrics such as prefill throughput, decode throughput, TPOT, and TTFT
>> Experience with one or more inference engines: e.g., vLLM, SGLang, and Modular Max
>> Familiarity with one or more distributed inference serving frameworks: e.g., llm-d, NVIDIA Dynamo, and Ray Serve etc.
>> Experience with AMD and NVIDIA GPUs, using software like CUDA, ROCm, AITER, NCCL, RCCL, etc.
>> Knowledge of distributed inference optimization techniques - tensordata parallelism, KV cache optimizations, smart routing etc.