Principal Supercomputing Networking/Software Engineer

Overview

On Site

$200k - 300k per year + equity

Full Time

Skills

Orchestration

Management

Kubernetes

Virtualization

Provisioning

Switches

Scalability

Routing Protocols

Border Gateway Protocol

Software Engineering

Linux

Debugging

Performance Tuning

GPU

Computer Networking

Design Software

Computer Hardware

Scripting

Documentation

Rust

HPC

InfiniBand

Ethernet

Storage

Weka

Ceph

Job Details

Principal Supercomputing Networking/Software Engineer

About the Role
As a Principal Systems Engineer, you'll be responsible for designing and operating the infrastructure that powers our global GPU clusters. Your scope will include system software, orchestration, and distributed automation, with networking as an integral layer rather than the primary specialization. The role emphasizes software engineering first, with the ability to design systems that dynamically integrate networking and compute at scale.

Key Areas of Impact
Designing and operating orchestration frameworks to manage tens of thousands of GPUs across Kubernetes, virtualization, and bare metal.
Developing automation frameworks for large-scale provisioning, monitoring, and fault tolerance.
Building distributed systems that can withstand node or cluster-wide failures.
Architecting software-defined networking solutions that integrate with underlay switches and support scalable overlay designs.
Collaborating with networking specialists to ensure fabric resilience, low latency, and scalability, leveraging routing protocols like BGP where needed.
Integrating high-performance distributed storage with compute and networking layers.

About You
Strong software engineering background, with experience building fault-tolerant distributed systems.
Comfortable with Linux internals, debugging, and performance optimization.
Exposure to GPU/HPC clusters.
Networking literacy: familiar with eBGP, VXLAN, RoCEv2, and InfiniBand, plus an understanding of how to design software systems that dynamically leverage these fabrics (rather than focusing on deep hardware detail).
Strong automation, scripting, and documentation skills.

Nice to Haves
Go or Rust (3+ years).
Deeper knowledge of HPC fabrics (InfiniBand, Ultra Ethernet).
Experience with high-performance storage (WEKA, VAST, Ceph).
Prior exposure to global distributed compute operations.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About LHi Group Ltd

Share