Senior HPC Systems Administrator

Overview

Remote

On Site

$0 - $1

Full Time

10% Travel

Skills

CentOS

HPC

Job Scheduling

MPI

InfiniBand

Microsoft Azure

IBM GPFS

Red Hat Linux

Linux

High Performance Computing

Job Details

RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for over 25 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis.

RedLine is looking for a Senior High Performance Computing (HPC) Systems Administrator to join our team. The administrator for this team will be providing support and administration for a large on-premise HPC cluster and a small cloud-based HPC cluster. The Senior HPC Systems Administrator will be an experienced individual with a strong security, Linux, HPC, configuration management, systems automation, and networking background.

Job Details

This position requires mission-critical monitoring and maintenance and will require off hours support in a team rotation.
U.S. Citizenship and the ability to obtain a Public Trust clearance is a requirement to apply.
The preference is for the candidate to be in the Phoenix, AZ area, however the position can be remote with the possibility of some travel.
This full-time position includes a comprehensive benefits package featuring paid time off, a 401(k) match, health insurance, and a full range of additional benefits.

Job Responsibilities:

Provide HPC cluster administration using technologies such as HPCM, Lustre, Slingshot, Cray OS, and Slurm
Engage with the customer to identify the needs and user stories to build enhancements and upgrades for the HPC clusters
Work with configuration management solutions to develop Ansible playbooks to support image generation and server support
Work with version control systems to perform and review Git pull requests from the team to ensure that the cluster support follows best practices
Update and expand existing systems monitoring capabilities
Develop automation tools for cluster administration
Participate in resource optimization and job scheduling software and policies
Support HPE-based Cluster Management solutions
Provide technical support to researchers using HPC resources, troubleshoot problems, and develop appropriate computational strategies.

Job Requirements:

Minimum of 7 years SLES, RedHat and CentOS Linux system administrator experience in an HPC environment.
Experience with schedulers/batch systems (e.g., SLURM, PBS, LSF)
Experience with managing parallel and cluster file systems (e.g., GPFS, Lustre)
Network management experience, including in an HPC context (e.g., InfiniBand, OmniPath)
Demonstrated ability to configure, deploy, and manage a major system area such as batch system, network, data storage, backup system, database system, or distributed computing
Scripting experience (e.g., bash, Python, Perl).

Preferred Skills:

Experience supporting HPC cloud environments (e.g., Azure)
Server provisioning and image management
Experience with Lmod/Lua
Experience with MPI technologies
One of the ISC² certifications (e.g., CISSP, SSCP) or Security+ certification
Experience integrating applications with cloud provider software stack.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share