Cloud Engineer || NVIDIA GPU || Onsite Multiple Hubs

Overview

On Site
Depends on Experience
Contract - W2
Contract - Independent
Contract - 24 Month(s)
50% Travel
Able to Provide Sponsorship

Skills

Amazon Web Services
Ansible
Collaboration
Artificial Intelligence
Computer Hardware
Bash
CUDA
Cloud Computing
Google Cloud
Communication
DevOps
TensorFlow
Regulatory Compliance
Lifecycle Management
HPC
Docker
Documentation
GPU
Kubernetes
Google Cloud Platform
Machine Learning (ML)
Grafana
Job Scheduling
Microsoft Azure
Knowledge Transfer
Linux Administration
Server Administration
Management
Orchestration
PyTorch
Python
Scripting
Servers
Terraform
Training
Workflow

Job Details

We are seeking a Cloud Engineer with strong NVIDIA GPU server administration experience to support high-performance AI/ML infrastructure.

Responsibilities:

Administer and maintain GPU-accelerated servers and clusters (NVIDIA A100, H100, etc.).

Manage and optimize NVIDIA software stack (CUDA, cuDNN, TensorRT, NCCL, Ncontainers).

Monitor system performance and troubleshoot hardware/software issues.

Collaborate with DevOps and AI teams for containerized workflows (Docker, Kubernetes) and distributed training environments.

Ensure infrastructure security and compliance with internal/external standards.

Lead upgrades, patching, lifecycle management of GPU servers.

Create documentation, automation scripts, and provide knowledge transfer to internal teams.


Required Skills & Experience

Bachelor’s Degree + 8+ years of overall experience

5+ years of server administration experience

3+ years working with NVIDIA GPU systems

Strong Linux administration experience (HPC/AI environments preferred)

Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, performance tuning

Familiarity with Slurm, Kubernetes, and job scheduling/orchestration tools

Experience with monitoring/automation tools (Prometheus, Grafana, Ansible, Terraform)

Strong scripting skills (Bash, Python)

Excellent troubleshooting and communication skills


Preferred / Nice to Have

NVIDIA Certified Professional

Experience with multi-node and multi-GPU training environments

Familiarity with AI/ML frameworks (PyTorch, TensorFlow)

Exposure to cloud GPU environments (AWS, Azure, Google Cloud Platform)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.