High-Performance Computing System Administrator

Overview

On Site
Full Time

Skills

Backup
Documentation
Firmware
Screening
Testing
FOCUS
Software Administration
High Performance Computing
Science
Computational Science
Computer Hardware
Resource Management
Staff Management
Artificial Intelligence
Machine Learning (ML)
Data Centers
Mergers and Acquisitions
HPC
Provisioning
Management
InfiniBand
Data Storage
File Systems
IBM GPFS
Server Hardware
Recruiting
Law
Accessibility
Bash
Scripting Language
Linux Administration
Computer Networking
Writing
Attention To Detail
Storage
Research

Job Details

Essential Duties

Configure, deploy, and support HPC clusters to support university research.

Install, administer and maintain hardware, system software, networking, accounts, and security measures to maintain performance, stability, and security.

Troubleshoot and fix issues with HPC hardware.

Deploy and support large-scale data storage and backup for critical research data.

Diagnose and correct system issues, whether these be issues with correct operation or performance.

Reinstate integrity of systems as quickly as possible following an outage in order to minimize downtime.

Manage end-user accounts.

Triage and solve user-submitted tickets related to HPC infrastructure.

Track system health and resource usage using monitoring software, and respond to issues.

Develop and maintain documentation for team members and occasionally for end users.

Research developments in HPC architectures and new technologies, processes, and methodologies.

Update and patch system software and firmware and software as needed to maintain performance and security.

Participate in determination of specifications for new systems, and tailor these to meet research needs.

Perform on-site installations and maintenance at data centers.

Apply technical expertise to identify and resolving system deficiencies.

Provide system services and analyze system performance for stakeholders and intended end users.

Perform other duties as assigned.

Required Education and Experience

Bachelor's Degree in a related field and a minimum of four years of related work experience or an equivalent combination of education and experience.

Background Check Requirements

All candidates for employment will be subject to pre-employment background screening for this position, which may include motor vehicle, DOT certification, drug testing and credit checks based on the position description and job requirements. All offers are contingent upon the successful completion of the background check. For additional information on the background check requirements and process visit "Learn about background checks" under the Applicant Support Resources section of Careers on the It's Your Yale website.

Position Focus:

The Yale Center for Research Computing (YCRC) seeks a High-Performance Computing System Administrator to join the center's Engineering team to provide hardware and software administration for a growing number of high-performance computing (HPC) clusters used in faculty research. The center is a computational core facility under the Office of the Provost created to support the advanced computing needs of the research community. The YCRC provides support that spans the Yale School of Medicine and Faculty of Arts & Sciences and encompasses Yale's HPC clusters, multiple petabytes of high-performance storage, and technologies for computational science and the analysis, sharing, and management of large-scale research data.

The successful candidate will support the infrastructure behind all of the above, including hardware, system and resource-management software, networking, storage, monitoring and security measures. This is a highly-collaborative effort, so frequent interaction with other system administrators, research-support staff, management, vendors and researchers is a regular part of the role. The successful candidate will also participate in designing, recommending and vetting architectures, specifications, and configurations of new systems, especially those using computational accelerators such as Graphics Processing Units (GPUs) to support Artificial Intelligence (AI) and Machine Learning (ML). To support this, the candidate will research developments in HPC architectures and new technologies, processes, and methodologies, especially those involving accelerators (such as GPUs). This position also involves on-site maintenance at the data centers where the equipment is located. (Currently, the equipment is in two data centers, one in Holyoke, MA and the other in West Haven, CT.)

Preferred Education, Experience and Skills:

  • HPC clusters, preferably with administration thereof
  • Computational accelerators such as GPUs
  • Cluster provisioning and management tools
  • Batch schedulers
  • Technology in a research environment
  • High-speed networking, e.g., InfiniBand
  • Large storage systems and parallel file systems such as GPFS and Lustre
  • Server hardware component replacement
  • Working in a data-center or server-room environment

Posting Disclaimer

The intent of this job description is to provide a representative summary of the essential functions that will be required of the position and should not be construed as a declaration of specific duties and responsibilities of the particular position. Employees will be assigned specific job-related duties through their hiring departments.

EEO Statement:

The University is committed to basing judgments concerning the admission, education, and employment of individuals upon their qualifications and abilities and seeks to attract to its faculty, staff, and student body qualified persons from a broad range of backgrounds and perspectives. In accordance with this policy and as delineated by federal and Connecticut law, Yale does not discriminate in admissions, educational programs, or employment against any individual on account of that individual's sex, sexual orientation, gender identity or expression, race, color, national or ethnic origin, religion, age, disability, status as a special disabled veteran, veteran of the Vietnam era or other covered veteran.

Inquiries concerning Yale's Policy Against Discrimination and Harassment may be referred to the Office of Institutional Equity and Accessibility (OIEA).

Required Skill/Ability 2:

Expertise with bash and at least one other scripting language. Demonstrated expertise with Linux system administration, including OS, networking, storage, and security.

Required Skill/Ability 4:

Excellent verbal and writing skills. Ability to interact well with team members and end users. Ability to work independently and across units.

Required Skill/Ability 5:

Attention to detail with the proven ability to take the care necessary to be entrusted with a system that hundreds of users depend on for research computation and the storage of research data.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.