Overview
Skills
Job Details
We are seeking a Cloud Engineer with strong NVIDIA GPU server administration experience to support high-performance AI/ML infrastructure.
Responsibilities:
Administer and maintain GPU-accelerated servers and clusters (NVIDIA A100, H100, etc.).
Manage and optimize NVIDIA software stack (CUDA, cuDNN, TensorRT, NCCL, Ncontainers).
Monitor system performance and troubleshoot hardware/software issues.
Collaborate with DevOps and AI teams for containerized workflows (Docker, Kubernetes) and distributed training environments.
Ensure infrastructure security and compliance with internal/external standards.
Lead upgrades, patching, lifecycle management of GPU servers.
Create documentation, automation scripts, and provide knowledge transfer to internal teams.
Required Skills & Experience
Bachelor’s Degree + 8+ years of overall experience
5+ years of server administration experience
3+ years working with NVIDIA GPU systems
Strong Linux administration experience (HPC/AI environments preferred)
Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, performance tuning
Familiarity with Slurm, Kubernetes, and job scheduling/orchestration tools
Experience with monitoring/automation tools (Prometheus, Grafana, Ansible, Terraform)
Strong scripting skills (Bash, Python)
Excellent troubleshooting and communication skills
Preferred / Nice to Have
NVIDIA Certified Professional
Experience with multi-node and multi-GPU training environments
Familiarity with AI/ML frameworks (PyTorch, TensorFlow)
Exposure to cloud GPU environments (AWS, Azure, Google Cloud Platform)