HPC Engineer

Overview

On Site
USD 90,000.00 - 100,000.00 per year
Full Time

Skills

Spring Framework
High Performance Computing
GRID
Computer Networking
Firmware
Scalability
ISO/IEC 20000
Application Support
Incident Management
IT Management
Identity Management
Provisioning
Storage
Hardware Development
HPC
Machine Learning (ML)
Research
Technical Support
Artificial Intelligence
Training
Service Management
Knowledge Transfer
Continuous Improvement
KPI
Vendor Management
Procurement
Computer Hardware
Information Technology
Computer Science
GPU Computing
FOCUS
Performance Tuning
Parallel Computing
Programming Languages
CUDA
Computer Architecture
Algorithms
Debugging
GPU
System Administration
Systems Engineering
Change Request Management
Communication
Collaboration
Root Cause Analysis
IT Service Management
Documentation
Management
Business Process

Job Details

HPC (High Performance Computing) Engineer

FT Position with our client in Cold Spring Harbor, NY

We are a forward-thinking technology organization seeking an experienced HPC Engineer to join our team. The ideal candidate will optimize and maintain our NVIDIA GPU-based high-performance computing infrastructure while collaborating with our technical teams to maximize computational efficiency.
Position Responsibilities

Cluster Implementation and Management:
  • Administration of the CSHL HPC cluster and storage system.
  • Optimizes, installs, and maintains the HPC software (EasyBuild, Anaconda).
  • Administration of HPC workload managers (Slurm, Grid Engine).
  • Collaborates with cross-functional teams to ensure seamless integration of hardware, software, and networking components.
  • Optimizes system performance, scalability, and reliability. Optimizes GPU performance and firmware to enhance the efficiency and scalability of decentralized AI inference tasks and general performance, processing and utilization.
  • Monitors cluster performance, identifies bottlenecks, and implements performance enhancements.
  • Adheres to best practice models to improve client services including ISO 20000 practices for service and application support, problem and incident management, server technology management, identity and access management, and management of continuous improvement.
  • Provides support and/or services for provisioning, installation/configuration, and maintenance of IT server systems hardware, software, and related infrastructure in alignment with organizational goals and requirements. Supports the CSHL community to adhere to standards for configurations.
  • Participates in new initiatives such as cluster expansion and storage usage efficiency.
  • Manages the full lifecycle of hardware development, from conception through deployment and maintenance.

User Support and/or Services:
  • Creates and updates end-user HPC documentation.
  • Works closely with scientists to optimize computational workloads, data movement, and parallel processing. Trains scientists on using the cluster effectively for AI workloads.
  • Optimizes, deploys, and maintains robust software to support high-performance AI/ML computations and parallel processing.
  • Collaborates with scientists and AI/ML engineers to tailor solutions that meet the specific needs of their research
  • Provides technical support, troubleshoots issues, and addresses user queries related to the cluster.
  • Assists in developing best practices for AI model training and deployment

Service Management:
  • As a key member of the IT Systems Engineering team, provides efficient, and effective resolution of incidents, and problems with a service-centric approach ensuring the stability and performance of CSHL services.
  • Documents systems configurations, processes, and procedures to ensure reproducible, stable systems that can be efficiently supported by CHSL IT teams. Works with other CSHL IT teams to assure knowledge transfer resulting in effective resolution of problems.
  • Contributes to the continual improvement of effective management of issues and incidents. Collaborates with other members of the Systems Engineering team to establish and monitor key performance indicators (KPIs) to measure systems and identify areas for improvement.
  • Maintains current knowledge of key technology trends, proactively preparing to assist the community with recommendations.
  • Communicates with and builds strong collaborative relationships with key stakeholders.

Vendor Management:
  • Coordinates with vendors and/or other CSHL teams to aid the procurement of necessary hardware, software, and services, ensuring cost-effective solutions that align with business needs.


Position Requirements

EDUCATION:
  • Bachelor's degree in information technology, computer science, or a related field (or equivalent combination of education and work experience).

EXPERIENCE:
  • 2+ years of experience in GPU computing, with a focus on performance optimization and parallel programming.
  • Proficiency in GPU programming languages such as CUDA.
  • Strong understanding of computer architecture, memory systems and parallel algorithms.
  • Experience with profiling and debugging tools for GPU applications desired, such as NVIDIA Nsignt.IT system administration and in IT server infrastructure operations.
  • IT Systems Engineering experience, including incident, problem, and request management processes.
  • Strong verbal and written communication skills, including ability to communicate, motivate, and collaborate effectively with diverse groups of people.
  • Ability to troubleshoot and support/drive issues to resolution, including root cause analysis.
SKILLS:
  • Motivated, friendly, committed, and energetic self-starter, dedicated to providing high quality and responsive IT services.
  • Excellent organization, documentation, time management and prioritization skills to manage multiple projects, locations, and technology needs.
  • Ability to maintain problem oversight and manage multiple simultaneous project tasks, prioritizing demands across functional work areas.
  • Ability to establish a practical working knowledge of CSHL business processes, interacting with key users to recommend solutions that best meet the strategic needs.
  • Has a mindset to improve standards, simplify, enhance functionality and/or transition to solutions to improve supportability.

Salary - 90-100K

#tech
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.