Apply Now

Senior Staff Data Center Operations Engineer, GPU Hardware Architecture

San Francisco, CA, US • Posted 30+ days ago • Updated 2 hours ago

Full Time

On-site

USD $179,000.00 - 218,000.00 per year

Fitment

Dice Job Match Score™

🔗 Matching skills to job...

Job Details

Skills

Conflict Resolution
Problem Solving
Energy
Manufacturing
Data Center Design
Strategist
Inventory
MI
Marketing Intelligence
Artificial Intelligence
End-user Training
Sourcing
Blueprint
PCI Express
Tier 3
Root Cause Analysis
Educate
Value At Risk
IT Management
Auditing
Roadmaps
InfiniBand
Mechanical Engineering
Python
Bash
Machine Learning (ML)
Reliability Analysis
Field Service
Workflow
Thermal Management
Management
Integrated Circuit
Fluid Mechanics
Computer Hardware
Systems Architecture
GPU
Cloud Computing
Electrical Engineering
Computer Engineering
Insurance
Life Insurance
Professional Development
Market Analysis
Law

Summary

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack - from electrons to tokens - to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.

We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that - with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.

We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved - people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.

If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.

The Mission

Crusoe is building the world's most climate-aligned AI infrastructure. As we scale toward unprecedented power densities and liquid-cooled architectures, the gap between "Data Center Design" and "Silicon Reality" must be bridged.

We are seeking a Senior Staff Data Center Operations Engineer, GPU Hardware Architecture to be the definitive technical authority on GPU platforms within the Data Center Engineering and Operations organization. Your mission is twofold: act as the primary technical consultant to our Data Center Engineering team to ensure future facilities are built for next-gen silicon, and provide the Operations team with the specialized tooling, SOPs, and predictive strategies needed to maintain peak cluster health.

The Strategic Bridge

For DC Engineering: You are the internal consultant. You translate upcoming GPU power/thermal roadmaps (NVIDIA/AMD) into design requirements for our next-generation facilities.
For Site Operations: You are the "Technical Enabler." You develop the diagnostic tools and technical SOPs that enable field technicians to resolve complex GPU issues with surgical accuracy.
For Sourcing: You are the "Technical Strategist." You define the technical sparing requirements and site-level inventory needs based on hardware failure telemetry.

Key Responsibilities

Engineering Education & Design Support: Provide deep-dive technical guidance to the Data Center Engineering team on upcoming silicon (e.g., NVIDIA Blackwell/Rubin, AMD MI350/400). Ensure future facility designs for power, cooling, and rack-spacing are ready for 2000W+ per-chip densities.
Predictive Operations & Telemetry: Leverage AI/ML methodologies to analyze fleet-wide telemetry (power draws, thermal gradients, and error rates). You will lead the transition from reactive troubleshooting to predictive maintenance, identifying "pre-failure" patterns in HBM or NVLink components before they impact customer training runs.
Technical Sparing Architecture: Architect the site-level sparing strategy from a technical perspective. Use failure telemetry and MTBF data to define the "Critical Spares List" and stocking levels required at each site to meet cluster uptime targets, providing these requirements to Sourcing for execution.
Operational Tooling & SOPs: Build the "Operational Blueprint" for the field. Create precision SOPs for high-stakes GPU repairs (e.g., baseboard swaps, manifold maintenance) and develop diagnostic tooling that allows Site Ops to identify NVLink flapping, PCIe degradations, or thermal throttling.
Advanced Troubleshooting & RCA: Act as the Tier-3 escalation point for the most complex hardware failures in the production environment. Lead Root Cause Analysis (RCA) on systemic issues that span the boundary between hardware and facility environmental factors.
Silicon Roadmap Authority: Maintain a 24-month forward-looking view of NVIDIA and AMD architectures. Educate internal stakeholders on how transitions in HBM4, interconnect speeds, and liquid-cooling will impact Crusoe's physical infrastructure.
Vendor & VAR Technical Lead: Support the technical relationship with OEMs and VARs. Audit their hardware builds, review their technical bulletins, and ensure their hardware roadmaps align with Crusoe's operational and engineering standards.

Technical Requirements

Silicon & Fabric Mastery: Expert-level knowledge of NVIDIA (Hopper/Blackwell/Rubin) and AMD (Instinct) architectures. Mastery of the physical and logical layers of NVLink, NVSwitch, and InfiniBand.
Infrastructure Bridge-Building: Ability to translate "Silicon Data Sheets" into "Mechanical Engineering Requirements." You can explain how a GPU's specific heat-load profile affects CDU sizing and secondary loop design.
Data-Driven Diagnostics: Proficient in Python, Go, or Bash to build telemetry and health-check tools (utilizing DCGM and ROCm). Experience using large datasets or basic ML frameworks to build "Smart Monitoring" that filters critical health signals from noise.
Operational Reliability Analysis: Experience using failure telemetry to inform site-level sparing requirements and field-service workflows.
Thermal Management: Deep understanding of the operational realities of Direct-to-Chip (D2C) cooling, including fluid dynamics, pressure-drop curves, and the lifecycle of dripless couplings.

Qualifications

10+ years in Hardware Engineering, Systems Architecture, or Data Center Infrastructure.
The "Consultant" Mindset: Proven track record of educating and influencing cross-functional teams (specifically Engineering and Operations).
GPU Authority: You have managed or architected GPU clusters at scale (thousands of nodes) at a hyperscaler, a GPU-specialized cloud, or a major silicon vendor.

Education: B.S. or M.S. in Electrical Engineering, Computer Engineering, or a related technical field.

Benefits:

Competitive compensation
Restricted Stock Units
Paid time off & paid holidays
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)
Cell phone stipend
401(k) Retirement plan with company match up to 4% of salary
Volunteer time off

Compensation Range

Compensation will be paid in the range of up to $179,000 -$218,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants knowledge, education, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 80183293
Position Id: 6309a7d02dedc8f5d23e80a7b2affc80
Posted 30+ days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Staff Software Engineer

San Francisco, California

•

Today

Full-time

USD 208,000.00 - 253,000.00 per year

Staff Network Deployment Engineer, Lab

San Francisco, California

•

Today

Full-time

USD 193,000.00 - 234,000.00 per year

Principal Software Engineer, GPU Compute

San Mateo, California

•

Today

Every day, tens of millions of people come to Roblox to explore, create, play, learn, and connect with friends in 3D immersive digital experiences- all created by our global community of developers and creators. At Roblox, we're building the tools and platform that empower our community to bring any experience that they can imagine to life. Our vision is to reimagine the way people come together, from anywhere in the world, and on any device.We're on a mission to connect a billion people with op

Full-time

USD 345,040.00 - 399,420.00 per year

Senior Hardware Engineer - Infrastructure

San Mateo, California

•

Today

Full-time

USD 243,290.00 - 295,250.00 per year

Search all similar jobs

More jobs at Crusoe in San Francisco, CA