Overview
Skills
Job Details
RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for 25 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis.
We are seeking a Senior HPC Systems Engineer to join our NASA NACS High Performance Computing team at in Mountain View, CA. This role primarily provides Supercomputing Systems Administration support for our NASA NACS High Performance Computing (HPC) contract.
U.S. citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. This position can be remote but will work Pacific time zone business hours. Travel to customer site will be required 2-3 times a year.
An individual at this skill level should have demonstrated extensive experience working with common HPC batch schedulers e.g. (PBS, Slurm, or Moab/Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to efficiently execute. This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources. Activities may include system patching, operating system upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate root cause of failure is a critical skill for this position. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.
Duties and Responsibilities:
- Oversee and directly contribute to significant ongoing HPC integrations to the environment
- Design and develop enhancements to the PBSPro batch scheduler based on customer-driven requirements.
- Apply best practices in system engineering, delivering projects on time, on budget, and with excellent quality
- Provide support to staff and end users to resolve HPC system issues
- Mentoring junior staff and cross training peers
- After hours/weekend support as required
- Moderate and contribute to Supercomputing System Administration that contributes to:
- Day-to-day operations of the Linux HPC clusters and storage systems
- Proactive monitoring, analyze, and correct system issues
- Development of scripts to automate repetitive tasks or tools to enhance support of the HPC systems
- System performance analysis and tuning
- Building, installing, and supporting user-requested software
- Supporting evaluation and assessment of new HPC technology
- Resolving user report issues and manage support tickets requests in Remedy
- Bachelors of Science degree in Computer Science or related field
- Strong computer science background with in-depth systems-level knowledge in operating systems and networking
- Solid understanding of the software development process, including requirements, use cases, design, coding, documentation and testing of scalable, distributed applications in a Linux environment
- A minimum of 10 years of experience with HPC systems administration
- A minimum of 10 years of experience developing system software in heterogeneous, multi-platform HPC environments
- Demonstrated equivalence of 10 years of Linux/UNIX user support experience and hands-on experience with administration of Linux systems
- Experience working with HPC applications and familiarity with at least C, C++, or Fortran
- Superior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or Bash
- Strong ability to interact with customers to understand needs, elicit requirements, and obtain feedback on prototype solutions
- Excellent communication and people skills; excellent time management and organizational skills
- Experience with system configuration management tools (e.g., Puppet, Ansible)
- Experience with revision control software (e.g., Git)
- Proficiency at technical writing.
- Experience with Lustre, and InfiniBand
- Familiarity/proficiency with OpenMP and Message Passing Interface (MPI) programming
- Experience with cloud technologies (AWS, Azure, Google Cloud Platform), OpenStack or Kubernetes is a plus