AI Platform Architect – DGX & SuperPOD

Remote • Posted 3 hours ago • Updated 3 hours ago
Contract W2
Contract Independent
Contract Corp To Corp
No Travel Required
Remote
$110 - $120/hr
Fitment

Dice Job Match Score™

🛠️ Calibrating flux capacitors...

Job Details

Skills

  • NVIDIA Certification

Summary

Title:NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems) 

Remote 

NVIDIA Certification required

 

We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations.
 
 

This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.
  

Core Responsibilities:
  AI Infrastructure Operations

  • Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
  • Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
  • Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
  • Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.

 Kubernetes Platform Engineering

  • Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
  • Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
  • Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.

 High-Performance Networking & DPUs

  • Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM).
  • Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
  • Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.

 Security & Compliance

  • Apply best practices from the CKS certification to secure containerized AI environments.
  • Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
  • Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.

 Monitoring, Telemetry & Optimization

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10513292
  • Position Id: 72261-12895-
  • Posted 3 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Yesterday

Easy Apply

Contract, Third Party

$80 - $110

Remote

7d ago

Easy Apply

Contract

75 - 80

Remote

10d ago

Contract

$180,000

Remote or Bolingbrook, Illinois

Today

Easy Apply

Full-time, Part-time, Contract, Third Party

USD 70-75

Search all similar jobs