Overview
Hybrid
Depends on Experience
Accepts corp to corp applications
Contract - Independent
Contract - W2
Contract - 12 Month(s)
Skills
HPC
MIG
GPU
DevOps
Prometheus
Grafana
Job Details
About Radiant Digital, Inc
Radiant Digital, Inc, delivers Program Management, Science, Engineering, and Technology Solutions to Federal, Commercial State, and Local Agencies. We are a subsidiary of Radiant Digital Services. We have a vast portfolio of clients across the country. Our Technology support to many DoD Agencies, NASA, Voice of America, FDA, and State Agencies such as State of FL, RI, MS, ND, VA, and WV extends our delivery of solutions worldwide.
Position: Kubernetes Engineer - GPU Platform Engineering
Location: Dallas, TX
Working Model: Hybrid
Duration: 12 Months with possible extension
Job Summary:
We are seeking a highly skilled Kubernetes Engineer to join our Platform Engineering function in Dallas.
In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.
You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.
Key responsibilities of the role include:
- Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
- Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
- Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
- Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
- Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
- Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
- Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
- Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
- Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
- Participating in performance tuning, incident response and production readiness reviews
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.