Cloud Engineer || NVIDIA GPU || Onsite Multiple Hubs

Overview

On Site
Depends on Experience
Contract - W2
Contract - Independent
Contract - 24 Month(s)
50% Travel
Able to Provide Sponsorship

Skills

Amazon Web Services
Ansible
Collaboration
Artificial Intelligence
Computer Hardware
Bash
CUDA
Cloud Computing
Communication
DevOps
Docker
Documentation
GPU
Google Cloud
Google Cloud Platform
Grafana
HPC
Job Scheduling
Knowledge Transfer
Kubernetes
Lifecycle Management
Linux Administration
Machine Learning (ML)
Management
Microsoft Azure
Orchestration
PyTorch
Python
Regulatory Compliance
Scripting
Server Administration
Servers
TensorFlow
Terraform
Training
Workflow

Job Details

We are seeking a Cloud Engineer with strong NVIDIA GPU server administration experience to support high-performance AI/ML infrastructure.

Responsibilities:

Administer and maintain GPU-accelerated servers and clusters (NVIDIA A100, H100, etc.).

Manage and optimize NVIDIA software stack (CUDA, cuDNN, TensorRT, NCCL, Ncontainers).

Monitor system performance and troubleshoot hardware/software issues.

Collaborate with DevOps and AI teams for containerized workflows (Docker, Kubernetes) and distributed training environments.

Ensure infrastructure security and compliance with internal/external standards.

Lead upgrades, patching, lifecycle management of GPU servers.

Create documentation, automation scripts, and provide knowledge transfer to internal teams.


Required Skills & Experience

Bachelor’s Degree + 8+ years of overall experience

5+ years of server administration experience

3+ years working with NVIDIA GPU systems

Strong Linux administration experience (HPC/AI environments preferred)

Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, performance tuning

Familiarity with Slurm, Kubernetes, and job scheduling/orchestration tools

Experience with monitoring/automation tools (Prometheus, Grafana, Ansible, Terraform)

Strong scripting skills (Bash, Python)

Excellent troubleshooting and communication skills


Preferred / Nice to Have

NVIDIA Certified Professional

Experience with multi-node and multi-GPU training environments

Familiarity with AI/ML frameworks (PyTorch, TensorFlow)

Exposure to cloud GPU environments (AWS, Azure, Google Cloud Platform)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.