System Administrator - High-Performance Computing HPC - NMK Global Inc.

Overview

Remote

Depends on Experience

Full Time

Skills

System Administrator

HPC

Job Details

hope you are doing .!

System Administrator High-Performance Computing (HPC)

Location: Remote (US Eastern Time business hours)

Position Summary

We are seeking a highly skilled and motivated System Administrator to support a cutting-edge High-Performance Computing (HPC) environment that enables advanced scientific research across multiple universities. This role is critical in ensuring the performance, reliability, and usability of an NVIDIA GPU-based HPC infrastructure.

The ideal candidate will bring hands-on experience with NVIDIA GPU systems, Kubernetes (K8s), Slurm, and NVIDIA Base Command Manager, along with a strong ability to document processes and train users. You'll work at the forefront of computational science, directly enabling breakthroughs in fields such as genomics, physics, climate modeling, healthcare, and defense.

Key Responsibilities

System Support & Troubleshooting

Provide operational support and problem resolution for the NVIDIA NVL72 GPU system.
Monitor system health and performance, proactively identifying and resolving issues to maintain high uptime and availability. Cluster & Workload Management
Administer and optimize the Slurm workload manager for efficient job scheduling and resource allocation.
Manage container orchestration using Kubernetes (K8s) within the HPC environment. Software & Patch Management
Maintain and update NVIDIA software stacks, ensuring proper patch management, version control, and security compliance.
Utilize NVIDIA Base Command Manager for system orchestration, monitoring, and optimization. Documentation & Knowledge Transfer
Author and maintain detailed technical documentation, including system architecture, configurations, and operational procedures.
Create clear, user-friendly "How To" guides to support onboarding and self-service among researchers and staff.
Conduct on-the-job training sessions for new team members and end users to facilitate knowledge transfer and best practices.
Qualifications Required

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
3 5 years of experience in system administration, preferably in HPC or GPU-accelerated environments.
Proficiency in Linux, Slurm, Kubernetes, and NVIDIA GPU technologies.
Demonstrated experience writing technical documentation and user support materials.
Strong communication and collaboration skills, particularly in academic or research-focused teams.
Prior experience with NVIDIA SuperPOD systems is mandatory.
Preferred
Familiarity with scientific computing workflows and research data management.
Experience supporting university or academic research environments.
Working knowledge of VAST storage, DDN storage, and networking, storage, and security best practices in HPC systems.

Team & Collaboration

You will collaborate closely with:

Another System Administrator and a Data Center Architect.
Additional administrators and technical experts supporting specific infrastructure operations as needed. Key Performance Indicators (KPIs)
System uptime and reliability of the HPC environment.
User satisfaction among university researchers.
Effective knowledge transfer and documentation quality for new staff members.
Work Allocation
75% hands-on technical work (system administration, optimization, and support)
25% documentation writing, training, and user enablement

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

System Administrator - High-Performance Computing HPC

Job Details

Share