Senior Virtualization Validation Engineer

Remote in San Francisco, CA, US • Posted 1 hour ago • Updated 1 hour ago
Full Time
On-site
USD $172,500.00 - 210,000.00 per year
Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

  • Problem Solving
  • Conflict Resolution
  • Energy
  • Manufacturing
  • HPC
  • Testing
  • FOCUS
  • Benchmarking
  • Test Suites
  • Network
  • SR-IOV
  • ProVision
  • Stress Testing
  • Communication
  • ROOT
  • Computer Science
  • Electrical Engineering
  • Virtualization
  • QEMU
  • Kernel-based Virtual Machine
  • Cloud Computing
  • Hypervisor
  • Research
  • GPU
  • CUDA
  • Stacks Blockchain
  • Computer Networking
  • Remote Direct Memory Access
  • InfiniBand
  • Linux Kernel
  • PCI Express
  • Management
  • Scripting
  • Python
  • Bash
  • Test Scenarios
  • Artificial Intelligence
  • Computer Hardware
  • Debugging
  • Orchestration
  • Kubernetes
  • Life Insurance
  • Professional Development
  • Insurance
  • Market Analysis
  • Law

Summary

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack - from electrons to tokens - to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that - with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved - people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

About the Role:

As a Virtualization Validation Engineer, you will be responsible for the end-to-end validation of large-scale, multi-node GPU clusters. You will focus on high-performance GPU Virtualization using QEMU and Cloud Hypervisor, ensuring that distributed workloads scale efficiently across multiple virtualized nodes. Your role is critical in validating the interconnect fabric and collective communication libraries (NCCL/RCCL) that power the world's most demanding AI and HPC applications.

San Francisco, Sunnyvale (Onsite)

What You'll Be Working On:
  • Multi-Node Scaling Validation: Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
  • Interconnect & Fabric Testing: Validate high-speed interconnects-including NVLink, Infinity Fabric, InfiniBand, and RoCE-within virtualized environments to ensure low-latency, high-bandwidth communication.
  • Hypervisor & GPU Virtualization: Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
  • Collective Communication Benchmarking: Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
  • Network Stack Validation: Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
  • Automated Cluster Orchestration: Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
  • Performance Bottleneck Analysis: Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.

What You'll Bring to the Team:
  • Education & Experience: 5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
  • Virtualization Expertise: Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.
  • Distributed GPU Ecosystems: Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.
  • Networking Knowledge: Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.
  • System Internals: Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).
  • Automation & Scripting: Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.

Bonus Points:
  • Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
  • Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
  • Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).

Benefits:
  • Competitive compensation and equity packages
  • Restricted Stock Units
  • Paid time off, paid holidays & leave of absence programs
  • Comprehensive health, dental & vision insurance
  • Employer contributions to HSA account
  • Paid parental leave
  • Paid life insurance, short-term and long-term disability
  • Professional development & tuition reimbursement
  • Mental health & wellness support
  • Commuter benefits (parking & transit)
  • Cell phone stipend
  • 401(k) Retirement plan with company match up to 4% of salary
  • Volunteer time off
  • Global travel insurance & emergency assistance
  • Daily meals allowance
  • Additional perks & programs specific to location

Compensation:

Compensation will be paid in the range of $172,500 - $210,000. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 80183293
  • Position Id: a6c7db35abd4235a830aaebe1cbbcd42
  • Posted 1 hour ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

San Francisco, California

Today

Full-time

USD 208,000.00 - 253,000.00 per year

San Francisco, California

Today

Full-time

USD 209,000.00 - 253,000.00 per year

San Francisco, California

Today

Full-time

USD 209,000.00 - 253,000.00 per year

San Francisco, California

Today

Full-time

USD 193,000.00 - 234,000.00 per year

Search all similar jobs