Cloud Engineer || NVIDIA GPU || Onsite Multiple Hubs

Overview

On Site

Depends on Experience

Contract - W2

Contract - Independent

Contract - 24 Month(s)

50% Travel

Able to Provide Sponsorship

Skills

Amazon Web Services

Ansible

Collaboration

Artificial Intelligence

Computer Hardware

Bash

CUDA

Cloud Computing

Communication

DevOps

Docker

Documentation

GPU

Google Cloud

Google Cloud Platform

Grafana

HPC

Job Scheduling

Knowledge Transfer

Kubernetes

Lifecycle Management

Linux Administration

Machine Learning (ML)

Management

Microsoft Azure

Orchestration

PyTorch

Python

Regulatory Compliance

Scripting

Server Administration

Servers

TensorFlow

Terraform

Training

Workflow

Job Details

We are seeking a Cloud Engineer with strong NVIDIA GPU server administration experience to support high-performance AI/ML infrastructure.

Responsibilities:

Administer and maintain GPU-accelerated servers and clusters (NVIDIA A100, H100, etc.).

Manage and optimize NVIDIA software stack (CUDA, cuDNN, TensorRT, NCCL, Ncontainers).

Monitor system performance and troubleshoot hardware/software issues.

Collaborate with DevOps and AI teams for containerized workflows (Docker, Kubernetes) and distributed training environments.

Ensure infrastructure security and compliance with internal/external standards.

Lead upgrades, patching, lifecycle management of GPU servers.

Create documentation, automation scripts, and provide knowledge transfer to internal teams.

Required Skills & Experience

Bachelor’s Degree + 8+ years of overall experience

5+ years of server administration experience

3+ years working with NVIDIA GPU systems

Strong Linux administration experience (HPC/AI environments preferred)

Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, performance tuning

Familiarity with Slurm, Kubernetes, and job scheduling/orchestration tools

Experience with monitoring/automation tools (Prometheus, Grafana, Ansible, Terraform)

Strong scripting skills (Bash, Python)

Excellent troubleshooting and communication skills

Preferred / Nice to Have

NVIDIA Certified Professional

Experience with multi-node and multi-GPU training environments

Familiarity with AI/ML frameworks (PyTorch, TensorFlow)

Exposure to cloud GPU environments (AWS, Azure, Google Cloud Platform)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share