HPC Engineer

Overview

On Site

USD 90,000.00 - 100,000.00 per year

Full Time

Skills

Spring Framework

High Performance Computing

GRID

Computer Networking

Firmware

Scalability

ISO/IEC 20000

Application Support

Incident Management

IT Management

Identity Management

Provisioning

Storage

Hardware Development

HPC

Machine Learning (ML)

Research

Technical Support

Artificial Intelligence

Training

Service Management

Knowledge Transfer

Continuous Improvement

KPI

Vendor Management

Procurement

Computer Hardware

Information Technology

Computer Science

GPU Computing

FOCUS

Performance Tuning

Parallel Computing

Programming Languages

CUDA

Computer Architecture

Algorithms

Debugging

GPU

System Administration

Systems Engineering

Change Request Management

Communication

Collaboration

Root Cause Analysis

IT Service Management

Documentation

Management

Business Process

Job Details

HPC (High Performance Computing) Engineer

FT Position with our client in Cold Spring Harbor, NY

We are a forward-thinking technology organization seeking an experienced HPC Engineer to join our team. The ideal candidate will optimize and maintain our NVIDIA GPU-based high-performance computing infrastructure while collaborating with our technical teams to maximize computational efficiency.
Position Responsibilities

Cluster Implementation and Management:

Administration of the CSHL HPC cluster and storage system.
Optimizes, installs, and maintains the HPC software (EasyBuild, Anaconda).
Administration of HPC workload managers (Slurm, Grid Engine).
Collaborates with cross-functional teams to ensure seamless integration of hardware, software, and networking components.

Optimizes system performance, scalability, and reliability. Optimizes GPU performance and firmware to enhance the efficiency and scalability of decentralized AI inference tasks and general performance, processing and utilization.

Monitors cluster performance, identifies bottlenecks, and implements performance enhancements.
Adheres to best practice models to improve client services including ISO 20000 practices for service and application support, problem and incident management, server technology management, identity and access management, and management of continuous improvement.
Provides support and/or services for provisioning, installation/configuration, and maintenance of IT server systems hardware, software, and related infrastructure in alignment with organizational goals and requirements. Supports the CSHL community to adhere to standards for configurations.
Participates in new initiatives such as cluster expansion and storage usage efficiency.
Manages the full lifecycle of hardware development, from conception through deployment and maintenance.

User Support and/or Services:

Creates and updates end-user HPC documentation.
Works closely with scientists to optimize computational workloads, data movement, and parallel processing. Trains scientists on using the cluster effectively for AI workloads.
Optimizes, deploys, and maintains robust software to support high-performance AI/ML computations and parallel processing.

Collaborates with scientists and AI/ML engineers to tailor solutions that meet the specific needs of their research
Provides technical support, troubleshoots issues, and addresses user queries related to the cluster.
Assists in developing best practices for AI model training and deployment

Service Management:

As a key member of the IT Systems Engineering team, provides efficient, and effective resolution of incidents, and problems with a service-centric approach ensuring the stability and performance of CSHL services.
Documents systems configurations, processes, and procedures to ensure reproducible, stable systems that can be efficiently supported by CHSL IT teams. Works with other CSHL IT teams to assure knowledge transfer resulting in effective resolution of problems.
Contributes to the continual improvement of effective management of issues and incidents. Collaborates with other members of the Systems Engineering team to establish and monitor key performance indicators (KPIs) to measure systems and identify areas for improvement.
Maintains current knowledge of key technology trends, proactively preparing to assist the community with recommendations.
Communicates with and builds strong collaborative relationships with key stakeholders.

Vendor Management:

Coordinates with vendors and/or other CSHL teams to aid the procurement of necessary hardware, software, and services, ensuring cost-effective solutions that align with business needs.

Position Requirements

EDUCATION:

Bachelor's degree in information technology, computer science, or a related field (or equivalent combination of education and work experience).

EXPERIENCE:

2+ years of experience in GPU computing, with a focus on performance optimization and parallel programming.
Proficiency in GPU programming languages such as CUDA.
Strong understanding of computer architecture, memory systems and parallel algorithms.
Experience with profiling and debugging tools for GPU applications desired, such as NVIDIA Nsignt.IT system administration and in IT server infrastructure operations.
IT Systems Engineering experience, including incident, problem, and request management processes.
Strong verbal and written communication skills, including ability to communicate, motivate, and collaborate effectively with diverse groups of people.
Ability to troubleshoot and support/drive issues to resolution, including root cause analysis.

SKILLS:

Motivated, friendly, committed, and energetic self-starter, dedicated to providing high quality and responsive IT services.
Excellent organization, documentation, time management and prioritization skills to manage multiple projects, locations, and technology needs.
Ability to maintain problem oversight and manage multiple simultaneous project tasks, prioritizing demands across functional work areas.
Ability to establish a practical working knowledge of CSHL business processes, interacting with key users to recommend solutions that best meet the strategic needs.
Has a mindset to improve standards, simplify, enhance functionality and/or transition to solutions to improve supportability.

Salary - 90-100K

#tech

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share