Platform Architect

Remote • Posted 2 hours ago • Updated 2 hours ago

Contract W2

No Travel Required

Remote

$85 - $95/hr

Fitment

Dice Job Match Score™

🫥 Flibbertigibetting...

Job Details

Skills

DGX
Orchestration
Artificial Intelligence
Infra
AI
Machine Learning Operations (ML Ops)
Machine Learning (ML)
Kubernetes
Computer Networking
GPU

Summary

Job Title: AI Infrastructure and Kubernetes Platform Architect – DGX Systems

Location: Remote
Duration: 6 months to 2 years

Must be articulate and speak clearly

Job Description:

We are seeking a highly skilled AI Infrastructure and Kubernetes Platform Architect with deep expertise in managing GPU-accelerated workloads on NVIDIA DGX systems. The ideal candidate will have hands-on experience with Kubernetes at the administrator, application developer, and security levels (CKA, CKAD, CKS), and will be responsible for designing, deploying, securing, and maintaining large-scale AI infrastructure powered by DGX BasePODs and SuperPODs. This role involves optimizing AI workloads, managing high-performance networking (InfiniBand), and ensuring operational excellence across NVIDIA AI systems and BlueField DPU environments.

Key Responsibilities:

Kubernetes and AI Platform Orchestration

Architect and maintain containerized AI/ML platforms using Kubernetes on DGX systems.
Integrate NVIDIA Base Command Manager with Kubernetes for workload scheduling and GPU resource optimization.
Design multi-tenant GPU resource partitioning strategies using MIG (Multi-Instance GPU) to maximize hardware utilization across concurrent AI workloads.
Implement and manage Helm charts, custom controllers, and GPU operators for scalable ML infrastructure.

DGX Infrastructure Administration

Administer and optimize NVIDIA DGX BasePODs and SuperPODs.
Ensure optimal GPU, CPU, and storage performance across AI clusters.
Leverage DGX System Administration best practices for lifecycle management and updates.
Coordinate capacity planning for DGX cluster expansion including rack power, cooling, and storage integration with NVIDIA AI Enterprise software stack.

High-Performance Networking & DPU

Deploy, monitor, and manage InfiniBand networks using Unified Fabric Manager (UFM).
Integrate BlueField DPUs for offloaded security, networking, and storage tasks.
Optimize end-to-end data pipelines from storage to GPUs.

Security and Compliance

Apply best practices from the CKS certification to harden Kubernetes clusters and AI workloads.
Implement secure service mesh and microsegmentation with BlueField DPU integration.
Conduct regular audits, vulnerability scanning, and security policy enforcement.

Automation & Monitoring

Automate deployment pipelines and infrastructure provisioning with IaC tools (Terraform, Ansible).
Monitor performance metrics using GPU telemetry, PrometheGrafana, and NVIDIA DCGM.
Troubleshoot and resolve complex system issues across hardware and software layers.
Implement MLOps workflows integrating KubeFlow Pipelines, NVIDIA Triton Inference Server, and model registry tooling to support end-to-end model training and production deployment.

Required Skills and Qualifications:

CKA, CKAD, CKS certifications – demonstrating full-stack Kubernetes expertise.
Proven experience with NVIDIA DGX systems and AI workload orchestration.
Hands-on expertise in InfiniBand networking, UFM, and BlueField DPU administration.
Strong scripting and automation skills in Python, Bash, YAML.
Familiarity with Base Command Manager, NVIDIA GPU Operator, and KubeFlow is a plus.
Ability to work across teams to support ML researchers, DevOps engineers, and infrastructure teams.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91173025
Position Id: 8923543
Posted 2 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Remote or Hybrid in Washington, District of Columbia

•

5d ago

DescriptionThe Senior Infrastructure Platform Architect will support the EIS Middleware Branch under the INF Automation & DevOps project. This role supports Economic Infrastructure Services (EIS) and ECON Survey Systems through platform engineering, infrastructure automation, and migration activities. The position focuses heavily on enabling containerization and Kubernetes orchestration within on-prem VMware and BareMetal environments while supporting hybrid cloud integration across AWS and Ora

Full-time

138000 - 145000

Cloud Platform Architect

Remote

•

8d ago

Role:Cloud/Platform Architect Location:Remote Type: Full time Role Summary Lead end-to-end architecture and delivery governance for a regulated AWS platform supporting CDAP (pharmacometrics) and BEE (biostats). Own multi-environment architecture (DEV/TEST/PROD/DR) using Terraform modularization and GitOps practices. Define platform blueprint for compute, storage, networking, identity, CI/CD, observability, DR, and operational readiness. Key Responsibilities Define target AWS architecture for VP

Easy Apply

Full-time, Third Party

1,20,000 - 1,30,000

Sr. Data & AI Platform Integration Architect - Raleigh, NC

Remote or Raleigh, North Carolina

•

Today

At Gilead, we're creating a healthier world for all people. For more than 35 years, we've tackled diseases such as HIV, viral hepatitis, COVID-19 and cancer - working relentlessly to develop therapies that help improve lives and to ensure access to these therapies across the globe. We continue to fight against the world's biggest health challenges, and our mission requires collaboration, determination and a relentless drive to make a difference. Every member of Gilead's team plays a critical ro

Full-time

USD 146,200.00 - 189,200.00 per year

Software Platform Technical Architect

Remote or Columbus, Ohio

•

Today

We Are: The Advanced Technology Centers (ATCs) are the engine for reinvention in our clients' transformation journey. Powered by more than 255,000 people across 24 countries, ATCs provide our clients with seamless access to industry insights and innovative technology solutions. Stronger together! The Advanced Technology Centers (ATCs) make a tremendous impact in solving our clients' business problems by leveraging innovation, intelligence, industry insights, new IT, and new technology skills.

Full-time

USD 132,500.00 - 302,400.00 per year

Search all similar jobs

Platform Architect

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs