Senior Software Engineer, Infrastructure/Networking (GPU)

Overview

Remote

100,000 - 180,000

Full Time

Accepts corp to corp applications

No Travel Required

Able to Provide Sponsorship

Skills

CUDA

Python

Artificial Intelligence

Kubernetes

GPU

Job Details

Company Description

Odyn Network is at the forefront of artificial intelligence innovation, building transformative AI solutions that demand cutting-edge, high-performance infrastructure. Our mission is to accelerate AI development through scalable, efficient, and reliable systems. Join us to shape the future of AI infrastructure and power groundbreaking machine learning workloads.

Role Description

As a Senior Engineer, you will be the technical owner of how GPU resources are scheduled, shared, and scaled across Generative AI workloads. Your expertise will directly drive faster experiments, higher model throughput, and significant cost savings per training run. If you’re passionate about transforming heterogeneous GPU fleets into a unified, high-efficiency “supercomputer,” this role is your opportunity to make a massive impact.

Included Responsibilities

Orchestrate GPU Clusters: Design, implement, load balance, and manage multi-tenant GPU clusters (on-premises, cloud, or hybrid) using Kubernetes, Slurm, or similar platforms, ensuring high utilization, fairness, and reliability.
Optimize Resource Placement and Sharing: Develop topology-aware schedulers and plugins (RoCE, NUMA, NVLink, PCIe, InfiniBand) leveraging MIG/MPS, preemption, quotas, and bin-packing strategies to achieve effective GPU utilization.
Automate Capacity and Autoscaling: Build workload-aware autoscaling systems for training and inference workloads (using tools like Ray, Run:AI, Volcano, or KubeFlow), integrating spot/preemptible strategies with checkpointing and graceful eviction.
Enhance Observability and SLOs: Implement deep telemetry for GPUs, network fabric, and jobs using Prometheus, Grafana, or OpenTelemetry. Define and monitor SLOs (e.g., queue time, runtime variance, failure rate) and create actionable dashboards and alerts.
Maximize Throughput and Cost Efficiency: Profile and optimize NCCL/CUDA/ROCm, GPUDirect RDMA, RoCEv2, and InfiniBand fabric parameters to minimize idle time and fragmentation. Model and report cost per GPU-hour and per training step to drive efficiency.
Optimize Storage and Data Paths: Collaborate on high-throughput I/O systems (Lustre, BeeGFS, Ceph, S3, NVMeoF, Alluxio caching) and dataset prefetching/checkpoint pipelines to ensure GPUs remain fully utilized.
Build Platform Glue: Develop Kubernetes operators, controllers, and admission webhooks; enforce multi-tenancy through RBAC, network policies, and quotas; and integrate with CI/CD pipelines (GitHub Actions, Argo CD) and secrets management (Vault).
Partner with ML Teams: Translate AI model requirements (e.g., DLRM, LLM pretraining/finetuning, diffusion, retrieval) into optimized cluster policies, instance configurations, and job templates that deliver seamless performance.

Qualification Requirements

2+ years building and operating distributed infrastructure, focused on compute/accelerator fleets, managing production clusters with GPUs.
Kubernetes (device plugins, operators, CRDs) and/or Slurm (partitions, QoS, fair-share).
Ray, Run:AI, or Volcano. Systems knowledge, including Linux internals (cgroups v2, eBPF basics), networking (ECN, pacing), and container technologies (Docker, CRI-O, Containerd).
CUDA/NCCL or ROCm/RCCL, MIG/MPS, topology-aware scheduling, and high-speed interconnects (InfiniBand HDR/NDR, RoCEv2, 100GbE and above).
Proficiency in Python (or GO/C++) for automation and tooling, plus infrastructure-as-code (Terraform, Helm, Ansible).
Experience with AI workload storage systems (Lustre, BeeGFS, Ceph, S3) and checkpointing strategies.
Data-driven mindset, with a track record of building utilization/queue-time dashboards, running A/B tests on schedulers, and delivering measurable performance gains.
Communication: Ability to collaborate with cross-functional teams, translating complex ML workload needs into robust infrastructure solutions.

Preferred Qualifications

Experience managing H100/A100, L40S, or MI300 GPU fleets and planning NVLink/NVSwitch configurations.
Expertise in inference serving at scale (e.g., Triton, KServe), tokenizer offloading, or KV-cache sharding.
Familiarity with cost modeling and FinOps for hybrid GPU fleets, including purchase vs. lease vs. cloud strategies.
Knowledge of multi-tenant cluster security (Pod Security, SELinux/AppArmor, image signing, network policies).
Understanding of queueing theory, bin-packing heuristics, or simulation tools (e.g., SimPy) for policy design.

What We Offer

Competitive salary and compensation packages.
Flexible work arrangements, including remote options.
Opportunities to work with cutting-edge AI technologies and collaborate with world-class AI researchers and engineers.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share