Senior HPC Infrastructure Engineer

Overview

Remote

USD 94,640.00 - 169,520.00 per year

Full Time

Skills

Pivotal

Art

Servers

Scalability

Artificial Intelligence

Customization

Database

Network Operations

Collaboration

Data Retention

Business Continuity Planning

Risk Management

Service Delivery

Disaster Recovery

Statistics

Estimating

Capacity Management

Presentations

IT Infrastructure

Computer Science

Red Hat Enterprise Linux

High Performance Computing

Management

LSF

Kubernetes

xCAT

IBM

Spectrum

IBM GPFS

MPI

InfiniBand

TCP/IP

Computer Networking

Aruba

Ethernet

Switches

Storage

Linux

Shell Scripting

Training

HPC

Research

Internet

Job Details

Join a cutting-edge team dedicated to pushing the boundaries of high-performance computing (HPC) and artificial intelligence (AI) infrastructure! As a Senior HPC Infrastructure Engineer, you'll play a pivotal role in designing, implementing, and optimizing our state-of-the-art HPC clusters and servers. Your expertise will ensure that our research computing environment excels in scalability, redundancy, and performance.

Key Responsibilities:

Lead the architecture, design, and implementation of advanced HPC/AI systems to support groundbreaking research.
Oversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability.
Drive system upgrades, customization, and seamless integration with database administrators, software developers, network operations, and data center teams.
Manage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiency.
Ensure continuous support and monitoring of our research computing infrastructure, delivering exceptional service 24/7.

What We Offer:

An opportunity to work with cutting-edge technology in a dynamic, collaborative environment.
A role that directly impacts the success of groundbreaking research projects.
A chance to collaborate with top-tier professionals across various disciplines.

If you're passionate about HPC technology and thrive in a fast-paced, innovative setting, we want to hear from you!

This position may be eligible for the possibility of remote work.

Job Responsibilities:

Oversee configuration and management of the IT infrastructure to support requirements (e.g. data retention, security, business continuity, disaster recovery, information risk management).
Monitor and evaluate the efficiency and effectiveness of infrastructure service delivery methods and procedures.
Lead and manage internal infrastructure through established regulations & standards.
Implement and monitor incident/problem & disaster recovery for infrastructure support.
Manage and provide current systems usage statistics, provide future projected growth estimates based on customer's demand.
Partner with internal teams to develop prioritization, metrics, and processes around capacity planning and infrastructure availability.
Periodically present capacity planning and performance reports to senior leaders during presentations and meetings.
Benchmark, analyze, and make recommendations for improvement of IT infrastructure.
Perform other duties as assigned to meet the goals and objectives of the department and institution.
Maintains regular and predictable attendance.

Minimum Education and/or Training:

Bachelor's degree in Computer Science, Engineering, Business or related field of study required.
Master's degree preferred.

Minimum Experience:

Minimum experience: Four (4) years of IT experience with experience in infrastructure operations and engineering environments.
Experience with Red Hat Enterprise Linux (RHEL) is highly preferred.
Experience with using and supporting Linux in a high-performance computing (HPC) cluster and research computing environment is highly preferred.
Must have experience managing an HPC cluster.
Experience with Slurm and/or LSF is highly preferred.
Experience with Kubernetes (e.g., Rancher, OpenShift, etc.) is a plus.
Experience with Base Command Manager, Bright Cluster Manager, or another HPC cluster manager (e.g., HPCM, xCAT, Warewulf, Scyld) is highly preferred.
Experience with IBM Spectrum Scale (GPFS) is required; experience with Lustre is a plus.
Experience with Message Passing Interface (MPI) is highly preferred.
Experience with InfiniBand, Ethernet, and TCP/IP networking and topology is highly preferred.
Experience with HPE Aruba Ethernet switches is preferred.
Experience with NVIDIA GPUs is required; experience with AMD GPUs is a plus.
Experience with NVIDIA GPUDirect Storage is a plus.
Advanced knowledge and strong understanding of in-depth HPC technologies and principals.
Must have strong knowledge of Linux security and Linux shell scripting.
Proven performance in earlier role/comparable role.

Compensation
In recognition of certain U.S. state and municipal pay transparency laws, St. Jude is including a reasonable estimate of the compensation range for this role. This is an estimate offered in good faith and a specific salary offer takes into account factors that are considered in making compensation decisions including but not limited to skill sets, experience and training, licensure and certifications, and other business and organizational needs. It is not typical for an individual to be hired at or near the top of the salary range and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current salary range is $94,640 - $169,520 per year for the role of Senior HPC Infrastructure Engineer.

Explore our exceptional benefits!

St. Jude is an Equal Opportunity Employer

No Search Firms

St. Jude Children's Research Hospital does not accept unsolicited assistance from search firms for employment opportunities. Please do not call or email. All resumes submitted by search firms to any employee or other representative at St. Jude via email, the internet or in any form and/or method without a valid written search agreement in place and approved by HR will result in no fee being paid in the event the candidate is hired by St. Jude.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share