InfiniBand Switch Engineer

Overview

Remote
Depends on Experience
Contract - Independent
Contract - W2

Skills

InfiniBand
Switch
NVIDIA

Job Details

One of our clients is urgently looking for an experienced InfiniBand Switch Engineer to design, deploy, and maintain high-performance, low-latency networking infrastructure to support HPC and AI/ML workloads. The ideal candidate will have a deep understanding of InfiniBand networking technologies, switch management, fabric design, and integration with compute and storage clusters.

Duration: Long-term

Hourly: Open (Depends on experience)

Key Responsibilities

  • Design, implement, and manage InfiniBand networks for high-performance computing or AI clusters.
  • Install, configure, and maintain InfiniBand switches (e.g., NVIDIA/Mellanox Quantum, Spectrum, or similar).
  • Monitor and troubleshoot InfiniBand fabric performance, latency, and connectivity issues using tools such as ibstat, ibdiagnet, perfquery, and NVIDIA Fabric Manager.
  • Manage fabric topology and optimize routing, partitioning, and QoS to ensure efficient utilization.
  • Integrate InfiniBand fabric with storage systems (e.g., Lustre, BeeGFS, GPFS) and compute nodes.
  • Collaborate with system administrators and application teams to ensure network reliability and scalability.
  • Develop automation scripts for configuration, monitoring, and reporting (Python, Ansible, Bash, etc.).
  • Perform firmware upgrades and maintain switch OS/software in compliance with security and performance best practices.
  • Document configurations, topology maps, and operational procedures.
  • Participate in capacity planning and future network expansion strategies.

Required:

  • Bachelor s degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).
  • 3+ years of experience managing or engineering InfiniBand or similar HPC network infrastructures.
  • Proficiency with InfiniBand concepts such as subnet managers (SM), LIDs, PKeys, and fabric partitioning.
  • Hands-on experience with Mellanox/NVIDIA networking hardware and tools (e.g., UFM, Fabric Manager).
  • Strong knowledge of Linux networking, scripting, and system administration.
  • Experience with cluster management and job scheduling systems (e.g., SLURM, PBS, or Kubernetes).

Preferred:

  • Experience with Ethernet-RDMA (RoCE), NVLink, or GPUDirect RDMA technologies.
  • Familiarity with high-performance storage (e.g., Lustre, BeeGFS, NFS over RDMA).
  • Knowledge of Infiniband performance tuning and benchmarking tools (e.g., ib_write_bw, ib_read_lat).
  • Certifications from NVIDIA/Mellanox or similar vendors.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About New Millennium Consulting