Role- HPC consultant
Location- Fremont, CA/ Tualatin, OR
Contract
HPC Cluster & Scheduler Management
• Design, configure, tune, and optimize SLURM partitions, queues, QoS, and scheduling policies to maximize cluster utilization and workload efficiency.
• Perform in-depth analysis of job scheduling behavior, bottlenecks, and resource contention.
• Troubleshoot job failures, performance degradation, and scheduler-related issues in production HPC environments.
• Implement fair-share, backfill, reservations, and policy-driven scheduling as required.
Storage Benchmarking & Procurement Support
• Lead HPC storage performance benchmarking using industry-standard tools (e.g., IOR, FIO, MDTest, IOzone).
• Analyze I/O patterns of HPC workloads and map them to appropriate storage architectures (parallel file systems, NVMe, Lustre, Spectrum Scale, etc.).
• Provide technical input for storage selection and procurement, including performance expectations, sizing, and cost-performance tradeoffs.
• Collaborate with vendors and internal teams during POCs and performance validation exercises.
HPC Application Build & Optimization
• Build, install, configure, and maintain HPC applications, compilers, libraries, and scientific software stacks.
• Optimize application performance using MPI, OpenMP, GPU acceleration (where applicable), and tuned math libraries.
• Support multiple compiler toolchains (GCC, Intel, LLVM, NVIDIA HPC SDK, etc.).
• Implement and manage environment modules (Lmod) or similar software management frameworks.
System Performance & Operations
• Conduct system-level performance tuning across compute, memory, network, and storage layers.
• Diagnose node-level issues involving CPU, GPU, interconnects (InfiniBand/Ethernet), and OS configurations.
• Create operational runbooks, performance baselines, and troubleshooting documentation.
• Support cluster upgrades, expansions, and hardware refresh activities.
Collaboration & Delivery
• Work closely with application owners, researchers, and infrastructure teams to meet aggressive delivery timelines.
• Translate workload requirements into practical HPC configurations and optimizations.
• Provide clear technical guidance and recommendations to leadership and stakeholders.
Required Skills & Experience
Core HPC Skills
• 8–12+ years of hands-on HPC engineering experience in production environments.
• Strong expertise with SLURM (configuration, tuning, troubleshooting).
• Solid understanding of Linux systems (RHEL/CentOS/Rocky/Alma preferred).
• Deep knowledge of HPC storage systems and I/O performance analysis.
• Proven experience building and optimizing HPC applications and libraries.
Technical Proficiency
• MPI implementations (Open MPI, MPICH), OpenMP
• Compilers and toolchains (GCC, Intel, NVIDIA HPC SDK)
• Performance tools (perf, vtune, nvprof/nsys, IB diagnostics)
• Environment modules (Lmod), package managers (Spack preferred)
• Bash/Python scripting for automation and diagnostics
Nice to Have
• Experience with GPU-based HPC workloads (NVIDIA CUDA, ROCm).
• Exposure to cloud-based HPC (Azure, AWS, Google Cloud Platform).
• Familiarity with parallel file systems such as Lustre or IBM Spectrum Scale.
• Vendor engagement experience for HPC hardware/storage evaluations.