Overview
Skills
Job Details
Hi
hope you are doing .!
System Administrator High-Performance Computing (HPC)
Location: Remote (US Eastern Time business hours)
Position Summary
We are seeking a highly skilled and motivated System Administrator to support a cutting-edge High-Performance Computing (HPC) environment that enables advanced scientific research across multiple universities. This role is critical in ensuring the performance, reliability, and usability of an NVIDIA GPU-based HPC infrastructure.
The ideal candidate will bring hands-on experience with NVIDIA GPU systems, Kubernetes (K8s), Slurm, and NVIDIA Base Command Manager, along with a strong ability to document processes and train users. You'll work at the forefront of computational science, directly enabling breakthroughs in fields such as genomics, physics, climate modeling, healthcare, and defense.
Key Responsibilities
System Support & Troubleshooting
- Provide operational support and problem resolution for the NVIDIA NVL72 GPU system.
- Monitor system health and performance, proactively identifying and resolving issues to maintain high uptime and availability. Cluster & Workload Management
- Administer and optimize the Slurm workload manager for efficient job scheduling and resource allocation.
- Manage container orchestration using Kubernetes (K8s) within the HPC environment. Software & Patch Management
- Maintain and update NVIDIA software stacks, ensuring proper patch management, version control, and security compliance.
- Utilize NVIDIA Base Command Manager for system orchestration, monitoring, and optimization. Documentation & Knowledge Transfer
- Author and maintain detailed technical documentation, including system architecture, configurations, and operational procedures.
- Create clear, user-friendly "How To" guides to support onboarding and self-service among researchers and staff.
- Conduct on-the-job training sessions for new team members and end users to facilitate knowledge transfer and best practices.
- Qualifications Required
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent professional experience.
- 3 5 years of experience in system administration, preferably in HPC or GPU-accelerated environments.
- Proficiency in Linux, Slurm, Kubernetes, and NVIDIA GPU technologies.
- Demonstrated experience writing technical documentation and user support materials.
- Strong communication and collaboration skills, particularly in academic or research-focused teams.
- Prior experience with NVIDIA SuperPOD systems is mandatory.
- Preferred
- Familiarity with scientific computing workflows and research data management.
- Experience supporting university or academic research environments.
- Working knowledge of VAST storage, DDN storage, and networking, storage, and security best practices in HPC systems.
Team & Collaboration
You will collaborate closely with:
- Another System Administrator and a Data Center Architect.
- Additional administrators and technical experts supporting specific infrastructure operations as needed. Key Performance Indicators (KPIs)
- System uptime and reliability of the HPC environment.
- User satisfaction among university researchers.
- Effective knowledge transfer and documentation quality for new staff members.
- Work Allocation
- 75% hands-on technical work (system administration, optimization, and support)
- 25% documentation writing, training, and user enablement