Hybrid in Charlotte, North Carolina
•
27d ago
vLLM, TensorRT-LLM, Triton Inference Server, SGLangInference optimization techniques:Continuous batchingSpeculative decodingKV cache / Prefix cachingModel optimization:FP8, AWQ, GPTQ Distributed & GPU Systems Tensor parallelism and large model scalingCUDA, NCCL, GPU architectureGPU partitioning & optimization (MIG)Kubernetes & ML Serving Kubernetes-based ML serving platformsKServe, OpenShift AIHelm charts, Operators, platform automationGPU Orchestration Run:AI or similar GPU scheduling/orchestra
Easy Apply
Full-time
Depends on Experience
