Overview
On Site
$100000 - $130000
Full Time
Skills
High Performance Computing
Visualization
IaaS
Systems Engineering
Linux Administration
Management
File Systems
Configuration Management
Procurement
Lifecycle Management
Computer Networking
Authentication
HIPAA
Artificial Intelligence
Machine Learning (ML)
Computational Science
Research
Computer Hardware
Collaboration
Linux
Red Hat Linux
SAN
IBM GPFS
InfiniBand
Bash
Python
Ansible
Terraform
Cloud Computing
Amazon Web Services
Google Cloud
Google Cloud Platform
System Administration
HPC
Servers
Operating Systems
Performance Tuning
Storage
Network
Scripting
Technical Support
Identity Management
Help Desk
Software Deployment
Backup Administration
Documentation
Regulatory Compliance
Inventory
Reporting
Insurance
SAP BASIS
Job Details
A research computing organization that serves as the primary provider of high-performance computing (HPC), storage, and visualization resources for a large academic research community is looking to bring on Senior HPC Systems Engineer. The team supports thousands of users across hundreds of research groups, enabling advanced research through centrally managed HPC infrastructure, scientific software, and technical expertise.
Join a small, highly collaborative systems and operations team responsible for building, operating, and scaling large-scale HPC environments. This role plays a critical part in maintaining and evolving complex on-prem and hybrid cloud infrastructure that supports diverse research workloads across multiple scientific disciplines. This is a hybrid positionrequiring 3 days onsite to support hands-on data center and infrastructure operations.
Required Skills & Experience
Desired Skills & Experience
What You Will Be Doing
Tech Breakdown
The Offer
You will receive the following benefits:
Applicants must be currently authorized to work in the US on a full-time basis now and in the future.
Join a small, highly collaborative systems and operations team responsible for building, operating, and scaling large-scale HPC environments. This role plays a critical part in maintaining and evolving complex on-prem and hybrid cloud infrastructure that supports diverse research workloads across multiple scientific disciplines. This is a hybrid positionrequiring 3 days onsite to support hands-on data center and infrastructure operations.
Required Skills & Experience
- 5-7+ years of experience in systems administration, HPC systems engineering, or related roles
- Strong Linux systems administration experience (Red Hat-based environments)
- Experience installing, configuring, and maintaining large compute clusters and servers
- Experience with HPC schedulers such as Slurm
- Experience managing high-performance or parallel file systems (e.g., GPFS or similar)
- Scripting experience with Bash and Python
- Experience using automation and configuration management tools (e.g., Ansible)
- Experience troubleshooting hardware, OS, storage, and networking issues in production environments
- Familiarity with hybrid infrastructure environments (on-prem with AWS and/or Google Cloud Platform)
- Bachelor's degree in a related technical field or equivalent practical experience
Desired Skills & Experience
- Experience in academic, research, or national lab HPC environments
- Experience with HPC hardware procurement and lifecycle management
- Familiarity with InfiniBand networking and HPC authentication mechanisms
- Experience with infrastructure-as-code tools (e.g., Terraform)
- Experience supporting HIPAA-compliant or regulated systems
- Exposure to AI/ML, scientific computing, or data-intensive research workloads
- Experience supporting heterogeneous hardware environments
- Strong documentation and cross-team collaboration skills
What You Will Be Doing
Tech Breakdown
- 60% Linux Systems & HPC Infrastructure (Red Hat, cluster administration)
- 20% Storage, Networking & Performance (GPFS, InfiniBand, monitoring)
- 15% Automation & Scripting (Bash, Python, Ansible, Terraform)
- 5% Cloud & Hybrid Integration (AWS, Google Cloud Platform)
- 40% Hands-on systems administration of HPC clusters, servers, and operating systems
- 20% Monitoring, performance tuning, and troubleshooting of compute, storage, and network components
- 15% Automation, scripting, patching, and security maintenance
- 10% User support, access management, and help desk ticket resolution
- 10% Software deployment, upgrades, backups, and restores
- 5% Documentation, compliance tracking, and inventory reporting
The Offer
You will receive the following benefits:
- Medical, Dental, and Vision Insurance
- Vacation Time
Applicants must be currently authorized to work in the US on a full-time basis now and in the future.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.