Senior Kubernetes Engineer

Hybrid in Irving, TX, US • Posted 3 hours ago • Updated 3 hours ago
Full Time
No Travel Required
Hybrid
200000 - 225000/yr
Fitment

Dice Job Match Score™

⭐ Evaluating experience...

Job Details

Skills

  • Kubernetes
  • NVIDIA GPU Operator
  • DCGM
  • MIG
  • NVML
  • custom Kubernetes operator
  • CRD
  • Go/Golang
  • kube-scheduler
  • Volcano
  • ArgoCD
  • Helm
  • Prometheus
  • RBAC
  • HPC

Summary

In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.
You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.

Responsibilities

  • Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM

  • Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services

  • Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer

  • Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano

  • Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance

  • Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry

  • Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper

  • Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD

  • Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize

  • Participating in performance tuning, incident response and production readiness reviews

Requirements

  • Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes, including GPU Operator, device plugin, NVML, MIG and DCGM

  • Proficiency in Go or Python for operator development and Kubernetes controller logic

  • Deep understanding of Kubernetes internals, including CRDs, RBAC, custom controllers and scheduler extensions

  • Experience with GPU-intensive workloads, for example for LLMs, training pipelines and scientific computing

  • Hands-on experience with Helm, Kustomize and GitOps workflows

  • Familiarity with CNI plugins, especially NVIDIA CNI and Multus

  • Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 90709083
  • Position Id: 8958697
  • Posted 3 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Irving, Texas

Today

Full-time

USD 74,000.00 per year

Dallas, Texas

7d ago

Easy Apply

Contract

70 - 75

Hybrid in Coppell, Texas

Today

Full-time

Coppell, Texas

Today

Easy Apply

Full-time

Search all similar jobs