Apply Now

Research Computing GPU Systems Engineer

Remote in Stanford, CA, US • Posted 15 hours ago • Updated 2 hours ago

Full Time

On-site

USD $190,577.00 - 200,000.00 per year

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

IT Operations
Computer Cluster Management
Leadership
Biology
Physics
Job Scheduling
Resource Allocation
Microsoft Windows
Incident Management
Root Cause Analysis
Storage Management
Benchmarking
Reporting
Performance Tuning
Remote Direct Memory Access
Deep Learning
JAX
Code Optimization
Optimization
Technical Support
Documentation
Workflow
Team Leadership
Mentorship
Strategic Planning
Vendor Relationships
Computer Hardware
Data Science
Computer Science
System Administration
Management
CUDA
Linux Administration
Red Hat Enterprise Linux
Ubuntu
Computer Networking
InfiniBand
File Systems
IBM GPFS
Scripting
Python
Bash
Docker
Kubernetes
Artificial Intelligence
Machine Learning (ML)
PyTorch
TensorFlow
Grafana
Open Source
HPC
GPU Computing
Machine Learning Operations (ML Ops)
Virtualization
MIG
IT Management
Creative Problem Solving
Communication
Collaboration
Adaptability
GPU
Research
Professional Development
System Integration Testing
Writing
SAFE
Training
Policies and Procedures
Budget
Inventory
Law
Recruiting
Human Resources

Summary

About the Role

Stanford Research Computing seeks an exceptional GPU Cluster Lead Engineer to oversee technical operations, optimization, and strategic development of Marlowe, Stanford's NVIDIA SuperPOD. This role combines deep technical expertise in GPU computing, large-scale cluster management, and leadership in supporting a diverse research community. You will serve as the technical authority on GPU infrastructure, driving system performance and reliability while enabling groundbreaking research in AI/ML, computational biology, physics, and beyond.

Key Responsibilities

System Operations & Management

Lead day-to-day operations of the GPU Cluster, ensuring optimal uptime and performance.
Architect monitoring, alerting, and observability solutions using Prometheus, Grafana, DCGM, and Base Command Manager.
Manage job scheduling and resource allocation using Slurm, implementing advanced GPU partitioning and configurations.
Coordinate maintenance windows, system upgrades, and capacity expansions; lead incident response and root cause analyses.
System storage management, optimization, benchmarking and observability reporting.

Performance Optimization & Engineering

Design performance tuning strategies for GPU utilization, job throughput, and system efficiency.
Optimize NVIDIA GPU fabric configurations including NVLink, NVSwitch, and InfiniBand RDMA networking.
Develop containerization strategies using NVIDIA N Docker, and Singularity/Apptainer.
Engineer solutions for deep learning frameworks (PyTorch, TensorFlow, JAX) and CUDA application optimization.
Benchmark system performance and collaborate with NVIDIA on optimization programs.

User Support & Research Enablement

Serve as primary technical consultant for researchers using GPU-accelerated computing,
Develop documentation, best practices guides, and training materials; deliver workshops on GPU computing workflows.
Profile and optimize user workloads, scaling applications from single-GPU to multi-node distributed training.

Team Leadership & Strategy

Mentor junior engineers and contribute to strategic planning for GPU infrastructure expansion.
Evaluate emerging GPU technologies and manage vendor relationships with NVIDIA and hardware suppliers.
Represent SRC in ongoing interactions with the Stanford Data Sciences group on AI/ML infrastructure; participate in on-call rotation.

Education & Experience

Bachelor's degree in Computer Science, Engineering, or related field and ten years of relevant experience or a combination of education and relevant experience.
5+ years in HPC systems administration or research computing; 3+ years managing GPU clusters (NVIDIA A100/H100)

Required Qualifications

Expert knowledge of NVIDIA GPU architecture, CUDA, and GPU computing principles (NVLink, MIG, GPUDirect)
Advanced Linux administration (RHEL, Ubuntu); expertise with Slurm job scheduler
Experience with high-performance networking (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS)
Strong scripting (Python, Bash) and containerization experience (Docker, Singularity, Kubernetes)
Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and distributed training techniques
Experience with monitoring tools (Prometheus, Grafana) and NVIDIA DCGM

Preferred Qualifications

Experience with Base Command Manager or Bright Cluster Manager
Background in academic research computing or national lab environments
Contributions to open-source HPC or GPU computing projects
Knowledge of MLOps practices and GPU virtualization (vGPU, MIG)

Key Competencies

Technical leadership
Creative problem-solving
Excellent communication with technical and non-technical audiences
Strong collaboration skills
Service-oriented mindset
Adaptability to rapidly evolving technology

What We Offer

Work with cutting-edge NVIDIA GPU technology enabling groundbreaking research
Professional development opportunities
Collaborative environment with talented engineers and researchers
Comprehensive Stanford benefits package including health, dental, retirement, and education benefits
Flexible work arrangements

Physical Requirements*:

Constantly perform desk-based computer tasks.
Frequently sit, grasp lightly/fine manipulation.
Occasionally stand/walk, writing by hand.
Rarely use a telephone, lift/carry/push/pull objects that weigh up to 10 pounds.

* Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources by submitting a contact form.

Working Conditions:

May work extended hours, evenings, and weekends.

Work Standards:

Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for safety; communicates safety concerns; uses and promotes safe behaviors based on training and lessons learned.
Subject to and expected to stay in sync with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in Stanford's Administrative Guide, ;/li>

The expected pay range for this position is $190,577 to $200,000 per annum.

Stanford University provides pay ranges representing its good faith estimate of the salary or hourly wage the university reasonably expects to pay for a position upon hire. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs.

At Stanford University, base pay represents only one aspect of the comprehensive rewards package. The Cardinal at Work website (;/strong>) provides detailed information on Stanford's extensive range of benefits and rewards offered to employees. Specifics about the rewards package for this position may be discussed during the hiring process.

The job duties listed are typical examples of work performed by positions in this job classification and are not designed to contain or be interpreted as a comprehensive inventory of all duties, tasks, and responsibilities. Specific duties and responsibilities may vary depending on department or program needs without changing the general nature and scope of the job or level of responsibility. Employees may also perform other duties as assigned.

Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources by submitting a contact form.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Dice Id: RTX169eef
Position Id: ae13b23b62baad29d54d910c16688c82
Posted 15 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Qualcomm Technologies
GPU Performance Engineer
Santa Clara, California
•
Today
Company: Qualcomm Technologies, Inc. Job Area: Engineering Group, Engineering Group > GPU ASICS Engineering General Summary: As a leading technology innovator, Qualcomm pushes the boundaries of what's possible to enable next-generation experiences and drives digital transformation to help create a smarter, connected future for all. As a Qualcomm GPU Engineer, you may architect, design, implement, verify, and/or optimize the performance and power of GPU cores. Qualcomm Engineers collaborate wi
Full-time
USD 216,600.00 - 325,000.00 per year

NVIDIA Corporation
Senior Software Engineer - AI Research Clusters
Santa Clara, California
•
Today
NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. Our invention-the GPU-functions as the visual cortex of modern computing and is central to groundbreaking applications from generative AI to autonomous vehicles. We are now looking for a Senior Software Engineer to help accelerate the next era of machine learning innovation. In this role, you will propose and implement engineering solutions to ensure delivery of functional, relia
Full-time
USD 152,000.00 - 241,500.00 per year

NVIDIA Corporation
Senior System Software Engineer - GPU Performance
Santa Clara, California
•
Today
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. We are the GPU Communications Libraries and Networking tea
Full-time
USD 152,000.00 - 241,500.00 per year

AMD (Advanced Micro Devices)
Staff Software Development Engineer- GPU, LLM, AI
Santa Clara, California
•
Today
WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the wo
Full-time
USD 127,400.00 per year

Search all similar jobs