GPU Inference Engineer

Overview

Remote
$45 - $50
Contract - W2
Contract - 6 Month(s)

Skills

GPU
Docker
Kubernetes
Data Storage
Authentication
Large Language Models (LLMs)

Job Details

Role - GPU Inference Engineer (Need 100% quality resumes)

Location US Remote

Dedicated Inference Service
We are now looking for devs with general cloud services / distributed services experience, with LLM experience as a secondary skill.

Required Skills
Deep experience building services in modern cloud environments on distributed systems (i.e., containerization (Kubernetes, Docker), infrastructure as code, CI/CD pipelines, APIs, authentication and authorization, data storage, deployment, logging, monitoring, alerting, etc.)
Experience working with Large Language Models (LLMs), particularly hosting them to run inference
Strong verbal and written communication skills. Your job will involve communicating with local and remote colleagues about technical subjects and writing detailed documentation.

Preferred Skills


  • Experience with building or using benchmarking tools for evaluating LLM inference for various models, engine, and GPU combinations.
    Familiarity with various LLM performance metrics such as prefill throughput, decode throughput, TPOT, and TTFT
    Experience with one or more inference engines: e.g., vLLM, SGLang, and Modular Max
    Familiarity with one or more distributed inference serving frameworks: e.g., llm-d, NVIDIA Dynamo, and Ray Serve etc.
    Experience with AMD and NVIDIA GPUs, using software like CUDA, ROCm, AITER, NCCL, RCCL, etc.
    Knowledge of distributed inference optimization techniques - tensor/data parallelism, KV cache optimizations, smart routing etc.

What You'll Be Working On:


  • Develop and maintain an inference platform for serving large language models optimized for the various GPU platforms they will be run on.
    Work on complex AI and cloud engineering projects through the entire product development lifecycle (PDLC) - ideation, product definition, experimentation, prototyping, development, testing, release, and operations.
    Build tooling and observability to monitor system health, and build auto tuning capabilities.
    Build benchmarking frameworks to test model serving performance to guide system and infrastructure tuning efforts.
    Build native cross platform inference support across NVIDIA and AMD GPUs for a variety of model architectures.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.