Overview
Skills
Job Details
Title: Senior HPC Infrastructure Engineer
Location : Skokie, IL , several days a week.
Longterm Contract with Possible Conversion
Job Description
Summary
The Senior HPC Infrastructure Engineer will support the design, implementation, optimization, and ongoing management of our High-Performance Computing (HPC) infrastructure. The role blends technical proficiency in system architecture design, Linux- based HPC clusters, high-speed interconnects, and HPC storage solutions, alongside day- to-day system administration. You will collaborate with cross-functional teams, including HPC Operations Engineers, researchers, and IT staff to ensure reliable, scalable, and secure HPC environments supporting complex scientific computations and data analysis.
Essential Functions, Roles, and Responsibilities
Design, deploy, and maintain scalable HPC systems across on-premises and cloud environments by collaborating with stakeholders and Cloud Engineers, defining system requirements, and optimizing hardware, Linux OS, software, and networking for performance, reliability, and scalability.
Administer and monitor HPC environments to ensure high availability, applying patches, security updates, and resolving issues promptly to minimize downtime and maintain optimal performance.
Design and manage high-performance storage solutions.
Implement robust backup, replication, archival, and disaster recovery strategies to ensure data integrity. Design and manage high-performance storage systems, implementing backup, replication, archival, and disaster recovery strategies to ensure data integrity and availability.
Conduct regular benchmarking and performance tuning of HPC applications, identify and resolve system bottlenecks, and collaborate with HPC Operations Engineers on hardware and configuration optimizations to maximize efficiency.
Collaborate with the Information Security team to implement and monitor security measures, ensure compliance with data protection standards, and stay current on cybersecurity best practices for HPC environments.
Maintain detailed documentation of system configurations, procedures, standards, and troubleshooting guides to support effective operations and knowledge sharing. Provide technical support for HPC infrastructure, conduct training sessions to promote effective use, and manage on-site end-user computing technologies to ensure a seamless user experience.
Professional Competencies
- Proficiency in Linux/Unix system administration.
- Familiarity with parallel computing frameworks (e.g., MPI, OpenMP).
In-dept understanding of networking concepts, storage technologies, and system performance tuning.
Hands-on experience with job scheduling and resource management systems (e.g., Slurm, Torque, PBS).
In-dept knowledge of high-speed interconnects (InfiniBand, Omni-Path) and GPU acceleration is a plus.
- Strong troubleshooting and diagnostic skills.
- Excellent verbal and written communication skills and a collaborative working style.
Education & Experience
Bachelor's degree in computer science, computer engineering, or equivalent combination of education and experience. Master's degree preferred.
- Experience supporting HPC environments in research or academic settings. Experience with scripting languages such as Bash and Python. Relevant technical certifications (e.g., Red Hat, CompTIA Linux+, or similar)