Location: New York City, New York United States Remote with Travel)
Long term
Role Summary
We are seeking a senior Data Center Architect to lead the planning and build out of AI class data centers and rack scale clusters purpose built for OpenAI scale workloads. You will own end to end delivery: site selection, power and cooling strategy, rack level designs, Bill of Materials (BoM), vendor alignment, commissioning, and operational handoff. This role emphasizes high density GPU infrastructure (e.g., Blackwell/Hopper/MI300X), liquid cooling ecosystems, InfiniBand/Ethernet fabrics, and a disciplined hardware plan that scales our own AI business from pilot racks to multi MW campuses.
Key Responsibilities
A. AI Factory Program & Site Development
Lead greenfield/brownfield planning for AI data centers (10 100+ MW): site selection, utility engagement, interconnect studies, and permitting/ESG baselines.
Define program phases (pilot pod hall campus) with capacity ramps, risk registers, build schedules, and acceptance criteria.
B. Rack Scale Compute & Fabrics (NVL class)
Architect rack scale GPU systems (e.g., NVL72/DGX SuperPOD patterns) including NVLink/NVSwitch inside the rack and NDR/800G fabrics across racks.
Produce detailed LLDs: rack elevations, leaf/spine port maps, optics (MPO 8/12), cable trays, and labeling standards; validate non blocking designs.
Right size ToR/aggregation, management/OOB networks, and telemetry for tens of thousands of GPUs with deterministic latency and throughput.
C. Power & Cooling Engineering
Engineer >50 150 kW per rack envelopes using A/B power, 48 V power shelves, busway, intelligent PDUs, and UPS/BESS/generator stacks for ride through and extended outages.
Design liquid cooling ecosystems: direct to chip cold plates, rack/floor CDUs, manifolds, rear door heat exchangers, leak detection, water treatment, and room neutral operation.
Model energy KPIs (PUE/WUE, tokens per watt) and heat reuse options; embed safety, redundancy (N, N+1, 2N), and serviceability in all designs.
D. Storage & Data Pipelines
Define parallel storage tiers (NVMe backed, IB/Ethernet multi rail) for training/feature IO; integrate object/file services for datasets, checkpoints, and results.
Size cache/staging and throughput targets; implement namespace growth plans and data lifecycle policies aligned to model training cadence.
E. Hardware Plan for Our AI Business
Create a multi year hardware roadmap: accelerators (GPU generations), CPU, memory/HBM, NICs, optics, CDUs, PDUs, cabinets, and spares.
Develop BoMs and vendor lots per phase; negotiate lead times and logistics; maintain cost/performance models and refresh triggers.
Standardize rack templates (Training, Inference, Storage, Fabric, Management) to minimize change orders and compress deployment timelines.
F. Security, Safety & Compliance
Implement zoning, micro segmentation, secure OOB, PAM/MFA, encryption at rest/in flight; ensure physical security and safety systems for liquid and high power environments.
Align designs with ISO 27001, SOC 2, OSHA/NFPA, electrical codes, and environmental reporting; document commissioning and emergency runbooks.
G. Governance & Vendor Management
Own SOWs, estimates, RAID logs, executive reporting, and change control across OEMs (compute, network, storage), cooling/power vendors, coloss.
Mentor architects/engineers; codify standards, templates, and automation (Ansible/Terraform) for repeatable delivery.
Data Center Hardware Plan Scope
Compute & Interconnect-High-density GPU racks (NVL-class, 72 GPrack) with NVLink / NVSwitch backplanes, dual-socket CPU nodes (Sapphire Rapids/Grace), and high-speed fabrics (400G InfiniBand or 800G Ethernet leaf-spine topology).
Power, Cooling & Physical Infrastructure-48V power shelves, redundant PDUs, UPS/BESS, and generators; liquid cooling ecosystems (direct-to-chip cold plates, CDUs, rear-door heat exchangers); high-flow cabinets with aisle containment, leak detection, and DCIM telemetry.
Storage & Connectivity- Parallel NVMe storage appliances, object/file tiers, multi-rail connectivity; structured cabling with MPO trunks and 400/800G optics; standardized labeling/testing for reliability and scalability..
Required Qualifications
10+ years in data center architecture/engineering; 5+ years building high density GPU clusters and commissioning liquid cooled racks.
Hands on experience with NVL class rack systems, NDR/800G fabrics, optical design (MPO), and parallel storage.
Power & cooling expertise: rack densities 50 150 kW, A/B power, 48 V distribution, UPS/BESS/generator design, liquid ecosystems (CDU/DLC/RDHx).
Strong BoM/vendor management; construction interface and commissioning leadership; safety and compliance literacy.
Scripting/IaC (Ansible, Terraform, Python) and rigorous documentation (HLD/LLD, runbooks, as builts).