Austin, Texas
•
Today
Job Description Our team is the GPU Availability and Monitoring team in the Compute Org. we are responsible for designing and developing architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services. These are essential for running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband. We are looking for a highly skilled and motivated distributed systems engineer who can architect solutions to scale an
Full-time
USD 96,800.00 per year