Senior HPC Systems Engineer

Overview

Remote

$0 - $1

Full Time

25% Travel

Skills

hpc

HPC batch schedulers

OpenMP

MPI

System Administration

PBSPro

Slurm

Linux

Job Details

RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for 25 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis.

We are seeking a Senior HPC Systems Engineer to join our NASA NACS High Performance Computing team at in Mountain View, CA. This role primarily provides Supercomputing Systems Administration support for our NASA NACS High Performance Computing (HPC) contract.

U.S. citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. This position can be remote but will work Pacific time zone business hours. Travel to customer site will be required 2-3 times a year.

An individual at this skill level should have demonstrated extensive experience working with common HPC batch schedulers e.g. (PBS, Slurm, or Moab/Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to efficiently execute. This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources. Activities may include system patching, operating system upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate root cause of failure is a critical skill for this position. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.

Duties and Responsibilities:

Oversee and directly contribute to significant ongoing HPC integrations to the environment
Design and develop enhancements to the PBSPro batch scheduler based on customer-driven requirements.
Apply best practices in system engineering, delivering projects on time, on budget, and with excellent quality
Provide support to staff and end users to resolve HPC system issues
Mentoring junior staff and cross training peers
After hours/weekend support as required
Moderate and contribute to Supercomputing System Administration that contributes to:
- Day-to-day operations of the Linux HPC clusters and storage systems
- Proactive monitoring, analyze, and correct system issues
- Development of scripts to automate repetitive tasks or tools to enhance support of the HPC systems
- System performance analysis and tuning
- Building, installing, and supporting user-requested software
- Supporting evaluation and assessment of new HPC technology
- Resolving user report issues and manage support tickets requests in Remedy

Requirements:

Bachelors of Science degree in Computer Science or related field
Strong computer science background with in-depth systems-level knowledge in operating systems and networking
Solid understanding of the software development process, including requirements, use cases, design, coding, documentation and testing of scalable, distributed applications in a Linux environment
A minimum of 10 years of experience with HPC systems administration
A minimum of 10 years of experience developing system software in heterogeneous, multi-platform HPC environments
Demonstrated equivalence of 10 years of Linux/UNIX user support experience and hands-on experience with administration of Linux systems
Experience working with HPC applications and familiarity with at least C, C++, or Fortran
Superior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or Bash
Strong ability to interact with customers to understand needs, elicit requirements, and obtain feedback on prototype solutions
Excellent communication and people skills; excellent time management and organizational skills
Experience with system configuration management tools (e.g., Puppet, Ansible)
Experience with revision control software (e.g., Git)
Proficiency at technical writing.

Preferred Skills:

Experience with Lustre, and InfiniBand
Familiarity/proficiency with OpenMP and Message Passing Interface (MPI) programming
Experience with cloud technologies (AWS, Azure, Google Cloud Platform), OpenStack or Kubernetes is a plus

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share