LLM Infrastructure Engineer

Overview

Hybrid
Depends on Experience
Contract - W2
Contract - 12 Month(s)
No Travel Required

Skills

LLM
GPU
Kubernetes
NVIDIA

Job Details

Role: LLM Infrastructure Engineer Duration: 12 months Remote
Overview:
The LLM Infrastructure Engineer will be responsible for designing, provisioning, and operating GPU-enabled infrastructure to host and scale large language models (LLMs). This role requires deep expertise in distributed systems, high-performance computing, and automation to support large-scale training and low-latency inference environments.
Key Responsibilities

  • Provision and manage GPU-enabled infrastructure and specialized hardware for training and inference workloads.
  • Design and optimize scalable systems capable of handling massive data volumes for LLM training pipelines.
  • Implement and maintain low-latency, high-throughput inference environments for serving LLMs in production.
  • Build and manage orchestration systems using Kubernetes and containerized workloads.
  • Automate infrastructure management with Infrastructure as Code (IaC) frameworks (e.g., Terraform, Ansible, Pulumi).
  • Monitor, troubleshoot, and optimize performance across compute, storage, and networking layers.
  • Collaborate with ML engineers and researchers to ensure infrastructure aligns with model training and deployment needs.

Required Skills & Experience

  • Proven hands-on experience with GPU-enabled infrastructure management (NVIDIA GPUs, CUDA, NCCL, GPU drivers).
  • Strong background in designing distributed systems that handle large-scale data pipelines and high-performance workloads.
  • Experience with Kubernetes orchestration and workload scaling for ML/AI use cases.
  • Proficiency with Infrastructure as Code (IaC) tools (Terraform, Ansible, Pulumi, etc.).
  • Strong knowledge of cloud platforms (AWS, Google Cloud Platform, Azure) and on-premise HPC clusters.
  • Familiarity with storage architectures (object storage, distributed file systems like Lustre, Ceph, BeeGFS).
  • Understanding of networking performance optimization (RDMA, InfiniBand, NVLink).
  • Experience supporting ML frameworks (PyTorch, TensorFlow, DeepSpeed, Hugging Face Accelerate) in distributed training/inference environments.

Preferred Qualifications

  • Experience with MLOps practices (CI/CD for ML, model versioning, monitoring).
  • Exposure to AI accelerators beyond GPUs (TPUs, Habana Gaudi, AMD MI).
  • Knowledge of observability tools (Prometheus, Grafana, Datadog) for GPU/cluster monitoring.
  • Familiarity with data engineering for large-scale preprocessing pipelines.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.