Overview
Skills
Job Details
Job Title: HPC Infrastructure/Network Engineer
Location: Ashburn, VA - Onsite
Duration: 6 to 8 months
Role Overview:
The HPC (High-Performance Computing) Role focuses on planning, implementing, and managing InfiniBand network configurations for high-performance computing in data centers. The role emphasizes network and physical network troubleshooting (e.g., NIC testing, Ixia-enabled testing), with a skill distribution of 60% network, 30% Linux + CI/CD, and 10% HPC. Responsibilities include configuring switches, routers, and adapters, implementing security protocols, monitoring performance, troubleshooting, collaborating with vendors, and developing automation scripts.
Key Responsibilities:
- Configure and manage InfiniBand networks, including switches, routers, adapters, and performance tuning (e.g., MTU, buffer sizes, PFC/DCB for congestion management).
- Conduct physical network troubleshooting (e.g., NIC testing, Ixia-enabled testing for performance validation).
- Develop automation scripts (Python, Shell) for network tasks, leveraging libraries like Netmiko, NAPALM, Jinja; Ansible a plus.
- Monitor performance using tools like EPM/IPM; implement security protocols (MACsec, IPsec, access controls).
- Collaborate with vendors for compatibility, POCs, and BOMs; support lab/pre-field testing.
- Document configurations and processes via MOP/SOP.
Qualifications:
- Bachelor s degree in Computer Science, IT, or related field.
- 5+ years of InfiniBand experience in enterprise/lab environments.
- Expertise in InfiniBand architecture, protocols; RoCE a plus.
- Proficient in Python, Shell scripting (junior developer level, 1 2 years) for network automation; Git experience preferred.
- Strong network security (MACsec/IPsec), troubleshooting, and performance tuning skills.
- Familiarity with RDMA applications, parallel computing frameworks (e.g., MPI, OpenMP).
- Certifications (e.g., IBTA, CCNP) a plus; Linux/UNIX proficiency and CI/CD mindset required.
Skill Distribution (60/30/10):
- 60% Network: Emphasis on InfiniBand troubleshooting, NIC testing, Ixia-enabled testing, and performance tuning (e.g., PFC/DCB, MTU).
- 30% Linux + CI/CD: Linux/UNIX administration, Python/Shell scripting for automation, CI/CD familiarity (Git/Jenkins).
- 10% HPC: Basic HPC cluster knowledge, RDMA applications, parallel computing (MPI/OpenMP).