Title:NVIDIA AI Infrastructure & Kubernetes Platform Engineer (DGX Systems)
Remote
NVIDIA Certification required
We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations.
This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices.
Core Responsibilities:
AI Infrastructure Operations
- Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads.
- Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning.
- Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools.
- Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes.
Kubernetes Platform Engineering
- Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator.
- Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes.
- Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows.
High-Performance Networking & DPUs
- Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM).
- Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput.
- Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance.
Security & Compliance
- Apply best practices from the CKS certification to secure containerized AI environments.
- Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments.
- Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts.
Monitoring, Telemetry & Optimization