Job Functions:
Own monitoring and performance tuning across HPC applications, SLURM job scheduler, networking, storage, and hardware to optimize workload efficiency.
Perform root-cause analysis to develop sustainable automated solutions.
Support and troubleshoot alerts and errors with savvy communication skills to interact with external entities and internal counterparts.
Keep up communication with new and existing external vendors on technical infrastructure work.
Keep up communication with new and existing external business interfaces day-to-day operation, planning connectivity, and exchange upgrades.
Oversee communication with others internally - project management across different functions to plan exchange upgrades, hardware refreshes, and other improvements/rollouts.
Build out and support research, trading, and enterprise infrastructure.
Qualifications:
5+ years of HPC system administration/architecture including RHEL/CentOS/Rocky Systems
Experience with job scheduler and resource management tools such as SLURM, Moab, or Torque
Knowledge of network storage systems such as DDN, IBM Spectrum Scale, NetApp, Weka, or Vast
Knowledge of parallel file systems such as Lustre or Spectrum Scale (GPFS)
Experience working with InfiniBand and high-speed Ethernet
Hands-on experience with configuration management tools such as xCAT, Ansible, Salt, and Terraform
Exposure to bare metal provisioning including DHCP, DNS, PXE Boot
Competence in Python, able to edit and create scripts