Principal Supercomputing Networking/Software Engineer

  • San Francisco, CA
  • Posted 2 days ago | Updated 5 hours ago

Overview

On Site
$200k - 300k per year + equity
Full Time

Skills

Orchestration
Management
Kubernetes
Virtualization
Provisioning
Switches
Scalability
Routing Protocols
Border Gateway Protocol
Software Engineering
Linux
Debugging
Performance Tuning
GPU
Computer Networking
Design Software
Computer Hardware
Scripting
Documentation
Rust
HPC
InfiniBand
Ethernet
Storage
Weka
Ceph

Job Details

Principal Supercomputing Networking/Software Engineer

About the Role
As a Principal Systems Engineer, you'll be responsible for designing and operating the infrastructure that powers our global GPU clusters. Your scope will include system software, orchestration, and distributed automation, with networking as an integral layer rather than the primary specialization. The role emphasizes software engineering first, with the ability to design systems that dynamically integrate networking and compute at scale.

Key Areas of Impact
Designing and operating orchestration frameworks to manage tens of thousands of GPUs across Kubernetes, virtualization, and bare metal.
Developing automation frameworks for large-scale provisioning, monitoring, and fault tolerance.
Building distributed systems that can withstand node or cluster-wide failures.
Architecting software-defined networking solutions that integrate with underlay switches and support scalable overlay designs.
Collaborating with networking specialists to ensure fabric resilience, low latency, and scalability, leveraging routing protocols like BGP where needed.
Integrating high-performance distributed storage with compute and networking layers.

About You
Strong software engineering background, with experience building fault-tolerant distributed systems.
Comfortable with Linux internals, debugging, and performance optimization.
Exposure to GPU/HPC clusters.
Networking literacy: familiar with eBGP, VXLAN, RoCEv2, and InfiniBand, plus an understanding of how to design software systems that dynamically leverage these fabrics (rather than focusing on deep hardware detail).
Strong automation, scripting, and documentation skills.

Nice to Haves
Go or Rust (3+ years).
Deeper knowledge of HPC fabrics (InfiniBand, Ultra Ethernet).
Experience with high-performance storage (WEKA, VAST, Ceph).
Prior exposure to global distributed compute operations.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About LHi Group Ltd