Senior HPC Systems Administrator

Overview

Remote
On Site
$0 - $1
Full Time
10% Travel

Skills

CentOS
HPC
Job Scheduling
MPI
InfiniBand
Microsoft Azure
IBM GPFS
Red Hat Linux
Linux
High Performance Computing

Job Details

RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for over 25 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis.
 
RedLine is looking for a Senior High Performance Computing (HPC) Systems Administrator to join our team. The administrator for this team will be providing support and administration for a large on-premise HPC cluster and a small cloud-based HPC cluster. The Senior HPC Systems Administrator will be an experienced individual with a strong security, Linux, HPC, configuration management, systems automation, and networking background.
 
Job Details
  • This position requires mission-critical monitoring and maintenance and will require off hours support in a team rotation.
  • U.S. Citizenship and the ability to obtain a Public Trust clearance is a requirement to apply.
  • The preference is for the candidate to be in the Phoenix, AZ area, however the position can be remote with the possibility of some travel.
  • This full-time position includes a comprehensive benefits package featuring paid time off, a 401(k) match, health insurance, and a full range of additional benefits.
Job Responsibilities:
  • Provide HPC cluster administration using technologies such as HPCM, Lustre, Slingshot, Cray OS, and Slurm
  • Engage with the customer to identify the needs and user stories to build enhancements and upgrades for the HPC clusters
  • Work with configuration management solutions to develop Ansible playbooks to support image generation and server support
  • Work with version control systems to perform and review Git pull requests from the team to ensure that the cluster support follows best practices
  • Update and expand existing systems monitoring capabilities
  • Develop automation tools for cluster administration
  • Participate in resource optimization and job scheduling software and policies
  • Support HPE-based Cluster Management solutions
  • Provide technical support to researchers using HPC resources, troubleshoot problems, and develop appropriate computational strategies.
Job Requirements:
  • Minimum of 7 years SLES, RedHat and CentOS Linux system administrator experience in an HPC environment.
  • Experience with schedulers/batch systems (e.g., SLURM, PBS, LSF)
  • Experience with managing parallel and cluster file systems (e.g., GPFS, Lustre)
  • Network management experience, including in an HPC context (e.g., InfiniBand, OmniPath)
  • Demonstrated ability to configure, deploy, and manage a major system area such as batch system, network, data storage, backup system, database system, or distributed computing
  • Scripting experience (e.g., bash, Python, Perl).
Preferred Skills:
  • Experience supporting HPC cloud environments (e.g., Azure)
  • Server provisioning and image management
  • Experience with Lmod/Lua
  • Experience with MPI technologies
  • One of the ISC2 certifications (e.g., CISSP, SSCP) or Security+ certification
  • Experience integrating applications with cloud provider software stack.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.