HPC Architect

Overview

Remote
Depends on Experience
Contract - W2

Skills

Nvidia DGX
GPU Clusters

Job Details

Job Title: HPC Infrastructure Architect
Location: 100% Remote (EST Hours)
Duration: 6 Months
Job Description:
  • We're seeking an experienced Infrastructure Architect to design, implement, and optimize NVIDIA DGX environments with a specialized focus on Run:ai orchestration. This role requires deep expertise in GPU-accelerated infrastructure and AI workload management to maximize resource efficiency and scalability.
Key Responsibilities
  • Architect DGX Solutions: Design and deploy NVIDIA DGX infrastructure. This role will primarily focus on solutions centered around the DGX B300 platform, but strong experience with previous generations, such as the DGX H100 and H200, is highly relevant and valued. A key aspect of this role will be integrating these DGX solutions with Run:ai for dynamic GPU orchestration.
  • Run:ai Implementation: Configure and manage Run:ai s AI-native scheduling, resource pooling, and policy engine to optimize GPU utilization across hybrid environments (on-premises, cloud, edge)
  • Lifecycle Management: Oversee end-to-end AI workflows from data preparation and model training to deployment using Run:ai s unified platform
  • Access Control: Implement and maintain role-based access control (RBAC) using Run:ai s predefined roles (e.g., System Admin, Department Admin) and scope-based permissions
  • Performance Optimization: Monitor and tune cluster performance using Run:ai s observability tools, ensuring maximal GPU throughput and minimal idle time
  • Cross-functional Collaboration: Partner with data science and IT teams to align infrastructure capabilities with AI project requirements
Required Qualifications
Technical Expertise:
  • 10+ Years experience in Linux Advanced Compute environments
  • Proficiency in NVIDIA DGX systems and Kubernetes-based orchestration.
  • Hands-on experience with Run:ai s dynamic scheduling, policy engine, and KAI Scheduler
  • Familiarity with hybrid/multi-cloud GPU resource management (AWS, Google Cloud Platform, Azure).
Operational Skills:
  • Ability to configure RBAC scopes (departments, projects) and workload prioritization in Run:ai
  • Experience optimizing distributed AI training and inference workloads.
  • Proactive Outreach: Initiate and maintain contact with NVIDIA technical teams on ongoing basis
  • Clear Communication: Ensure clear and consistent communication channels for discussions related to bugs, technical updates, and other issues.
  • Certifications: NVIDIA DGX System or Run:ai certification preferred.
Preferred Experience
  • Deploying Run:ai in large-scale AI factories with 100+ GPUs.
  • Managing NVIDIA AI Enterprise software stacks.
  • Integrating Run:ai with MLOps pipelines for automated resource provisioning
  • Familiar with NVIDIA Mission Control AI factory management platform (includes NVIDIA Base Command Manager, Run:ai and software including Autonomous Job Recovery, On-Demand Health Checks, Customizable dashboards)
  • Familiar with SLURM: bare-metal or containerized access to the compute infrastructure.
  • Experience with Spectrum-X is a plus
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Cloud Destinations LLC