HPC Architect

Overview

Remote

Depends on Experience

Contract - W2

Skills

Nvidia DGX

GPU Clusters

Job Details

Job Title: HPC Infrastructure Architect

Location: 100% Remote (EST Hours)

Duration: 6 Months

Job Description:

We're seeking an experienced Infrastructure Architect to design, implement, and optimize NVIDIA DGX environments with a specialized focus on Run:ai orchestration. This role requires deep expertise in GPU-accelerated infrastructure and AI workload management to maximize resource efficiency and scalability.

Key Responsibilities

Architect DGX Solutions: Design and deploy NVIDIA DGX infrastructure. This role will primarily focus on solutions centered around the DGX B300 platform, but strong experience with previous generations, such as the DGX H100 and H200, is highly relevant and valued. A key aspect of this role will be integrating these DGX solutions with Run:ai for dynamic GPU orchestration.
Run:ai Implementation: Configure and manage Run:ai s AI-native scheduling, resource pooling, and policy engine to optimize GPU utilization across hybrid environments (on-premises, cloud, edge)
Lifecycle Management: Oversee end-to-end AI workflows from data preparation and model training to deployment using Run:ai s unified platform
Access Control: Implement and maintain role-based access control (RBAC) using Run:ai s predefined roles (e.g., System Admin, Department Admin) and scope-based permissions
Performance Optimization: Monitor and tune cluster performance using Run:ai s observability tools, ensuring maximal GPU throughput and minimal idle time
Cross-functional Collaboration: Partner with data science and IT teams to align infrastructure capabilities with AI project requirements

Required Qualifications

Technical Expertise:

10+ Years experience in Linux Advanced Compute environments
Proficiency in NVIDIA DGX systems and Kubernetes-based orchestration.
Hands-on experience with Run:ai s dynamic scheduling, policy engine, and KAI Scheduler
Familiarity with hybrid/multi-cloud GPU resource management (AWS, Google Cloud Platform, Azure).

Operational Skills:

Ability to configure RBAC scopes (departments, projects) and workload prioritization in Run:ai
Experience optimizing distributed AI training and inference workloads.
Proactive Outreach: Initiate and maintain contact with NVIDIA technical teams on ongoing basis
Clear Communication: Ensure clear and consistent communication channels for discussions related to bugs, technical updates, and other issues.
Certifications: NVIDIA DGX System or Run:ai certification preferred.

Preferred Experience

Deploying Run:ai in large-scale AI factories with 100+ GPUs.
Managing NVIDIA AI Enterprise software stacks.
Integrating Run:ai with MLOps pipelines for automated resource provisioning
Familiar with NVIDIA Mission Control AI factory management platform (includes NVIDIA Base Command Manager, Run:ai and software including Autonomous Job Recovery, On-Demand Health Checks, Customizable dashboards)
Familiar with SLURM: bare-metal or containerized access to the compute infrastructure.
Experience with Spectrum-X is a plus

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About Cloud Destinations LLC

Share