HPC/AI Linux Administrator (Scientist 2/3)

Overview

On Site
USD 101,700.00 - 168,200.00 per year
Full Time

Skills

High Performance Computing
Technical Writing
Verification And Validation
As-is Process
Art
System Administration
Linux Administration
Command-line Interface
Software Security
Scripting
Bash
Perl
Python
Configuration Management
Progress Chef
Puppet
Ansible
CFEngine
Technical Analysis
Communication
Collaboration
Publications
Presentations
Computer Hardware
Orchestration
Computer Networking
InfiniBand
Leadership
Project Planning
Delegation
HPC
Provisioning
Mentorship
Artificial Intelligence
Workflow
Machine Learning (ML)
Writing
Debugging
Kubernetes
Microservices
Cloud Computing
Effective Communication
Attention To Detail
Organizational Skills
Analytical Skill
Problem Solving
Conflict Resolution
Multitasking
Git
Continuous Integration
Continuous Delivery
Unix
Linux
Operating Systems
Splunk
Data Storage
ZFS
EXT
XFS
File Systems
Algorithms
RESTful
Interfaces
Storage
Ceph
Reporting
Management
Russian
Research
Science
PPO
Health Insurance
Network
Insurance
RADIUS
Design Of Experiments
Security Clearance
Authorization
Testing
Law
LOS
Recruiting

Job Details

Description

Job Title HPC/AI Linux Administrator (Scientist 2/3)

Location Los Alamos, NM, US

Organization Name HPC-OPS/High Performance Computing Operations Group

Minimum Salary

Maximum Salary

What You Will Do

Join the High Performance Operations Group (HPC-OPS) in operating and maintaining some of the fastest supercomputers in the world. Designing, operating and maintaining these systems requires highly skilled personnel that specialize in both the hardware and software aspects of High Performance Computing. Innovators at heart, HPC-OPS Linux Administrators work both independently and collaboratively to maintain and implement capability improvements across a complex computing environment. This team is currently building on-premise cloud-like infrastructure to support the AI/ML/LLM needs of the laboratory.

The Platforms Team is seeking to add highly knowledgeable and motivated team members to help build and deploy the AI/ML/LLM infrastructure for LANL. This person will be an expert Linux Administrator who will help design, build and run our production NVidia DGX/HGX pods optimized for our environment and workflow. They will run and manage both admin and user-facing services with an understanding of modern AI/ML/LLM user workflows, Kubernetes, and other common tools. The successful candidate will participate in periodic on-call responsibilities managing HPC clusters and AI infrastructure, while actively growing their technical skills and staying up to date with the latest technologies in the field. In addition, the selected candidate will have the opportunity to develop technical products such as technical documentation, presentations, technical papers, and reports, to communicate findings internally and at conferences.

The selected HPC/AI Linux Administrator (Scientist 2/3) will provide strategic design, testing, analysis, administration, configuration management, verification, and validation of the newly developed cloud-like infrastructure and specialized compute infrastructure for AL/ML workloads. Mentoring of students, junior staff, and peers in technical and professional growth activities is highly valued, as is maintaining state-of-the-art technical expertise and knowledge within HPC system administration and developing new skills in related disciplines. This is your chance to directly support our national security mission and continue to make LANL the best place to work as a member of a dynamic, team-oriented, and leading-edge technical capability team.

This position will be filled at either the Scientist 2 or 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

What You Need

Minimum Job Requirements:

Scientist 2: ($101,700 - $168,200)
  • Advanced Linux Administration Expertise: Demonstrated knowledge of administering production Linux computer systems, including strong command line Linux operating system skills, working knowledge of or experience with hardware and software security practices, and experience scripting in Bash, Perl, Python, or similar languages.
  • Configuration Management Expertise: Demonstrated experience with configuration and automation tools and practices, such as Chef, Puppet, Ansible, Salt, CFEngine, or similar tools.
  • Troubleshooting and Technical Analysis Acumen: Significant knowledge and demonstrated experience in formulating and testing hypotheses, investigating alternative solutions, and recommending solutions to technical problems.
  • Computer Networking Expertise: Working knowledge of networking concepts and practices.
  • Communication and Teaming Skills: Demonstrated effective communication skills, both verbal and written, including the ability to communicate technical information to both technical and non-technical personnel, to provide assistance and knowledge to peers, to collaborate with Group members, other HPC Group personnel and vendor representatives, as required, and to formulate and communicate technical results and findings to technical audiences and readerships (examples can include publications, team projects, and presentations).
  • Troubleshooting skills: Demonstrated ability to troubleshoot hardware and software errors, prioritizing problems and assessing impact to stakeholders, documenting problems and solutions.

Additional Job Requirements for Scientist 3: ($122,300 - $206,300):

In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:
  • Container Orchestration Expertise: Demonstrated experience managing, administering and maintaining large production Kubernetes clusters.
  • Troubleshooting Expertise: Experience troubleshooting and debugging user workflows in a Kubernetes environment.
  • Computer Networking Expertise: High performance interconnects, preferably NVLink and InfiniBand networks. Leadership: Demonstrated experience with project planning and management. Ability developing and leading complex projects, generating formal project plans, delegating tasks, and providing routine updates to management.
  • HPC Experience: Demonstrated experience building, installation, and administration of HPC systems. Experience with modern image building and provisioning tools.
  • Mentoring: Ability to mentor and lead individual junior team members and students.

Education/Experience at Scientist 2:

The position requires a Bachelor' degree in a STEM field from an accredited college and university and 4 years of relevant experience or an equivalent combination of education and experience directly related to the occupation.

Education/Experience at Scientist 3:

The position requires a Master's degree in a STEM field from an accredited college or university and 6 years of relevant experience or an equivalent combination of education and experience directly related to the occupation.

Desired Qualifications:
  • Experience running NVidia DGX/HGX systems or pods in a production environment
  • Strong understanding of AI/ML workflows and experience setting up and maintaining user-facing AL/ML tools and services (such JupyterHub).
  • Experience writing and debugging Kubernetes microservices in Go
  • Knowledge of Cloud technologies
  • Experience integrating operational metrics into a monitoring system such as Splunk
  • Demonstrated effective communication skills, including demonstrated ability to work productively with customers and vendors
  • High attention to detail including excellent organizational skills, analytical thinking, observational and problem-solving skills. Proven ability to independently multi-task and adjust to the workings of a dynamic and fast paced environment.
  • Experience with Git, creating issues, branches, merge requests and using CI/CD pipelines
  • Experience modifying Unix/Linux operating systems (e.g., enabling/disabling kernel modules).
  • Practical experience with Splunk or other monitoring tools.
  • Knowledge of or demonstrated experience with parallel and distributed storage systems; knowledge of file systems such as ZFS, EXT, XFS; working knowledge of file system structures and algorithms; and/or experience with Object storage and RESTful storage interfaces. Experience administering cluster storage technologies such as Ceph.
  • Demonstrated ability to develop new methods, techniques, or approaches to address critical technical problems and/develop new technical capabilities.
  • An active DOE Q Clearance

Work Location:

This position will be located in Los Alamos, NM, with the potential for a hybrid work arrangement (60% onsite/40% offsite) from a location within 2 hours ground commute of this location. Reporting onsite will be required. Hybrid is at the discretion of management and can change at any time with appropriate notice.

Position commitment: Regular appointment employees are required to serve a period of continuous service in their current position in order to be eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the time required, they may only apply for Laboratory jobs with the documented approval of their Division Leader. The position commitment for this position is 1 year.

Note to Applicants:

For consideration, applicants should submit a cover letter addressing how their knowledge, skills and abilities meet the minimum requirements along with a resume.

Due to federal restrictions contained in the current National Defense Authorization Act, citizens of the People's Republic of China-including the special administrative regions of Hong Kong and Macau-as well as citizens of the Islamic Republic of Iran, the Democratic People's Republic of Korea (North Korea), and the Russian Federation, who are not Lawful Permanent Residents ("" holders) are prohibited from accessing facilities that support the mission, functions, and operations of national security laboratories and nuclear weapons production facilities, which includes Los Alamos National Laboratory.

Where You Will Work

Located in beautiful northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. Our generous benefits package includes:

PPO or High Deductible medical insurance with the same large nationwide network

Dental and vision insurance

Free basic life and disability insurance

Paid childbirth and parental leave

Award-winning 401(k) (6% matching plus 3.5% annually)

Learning opportunities and tuition assistance

Flexible schedules and time off (PTO and holidays)

Onsite gyms and wellness programs

Extensive relocation packages (outside a 50 mile radius)

Additional Details

Directive 206.2 - Employment with Triad requires a favorable decision by NNSA indicating employee is suitable under NNSA Supplemental Directive 206.2 . Please note that this requirement applies only to citizens of the United States. Foreign nationals are subject to a similar requirement under DOE Order 142.3A.

Clearance: Q (Position will be cleared to this level). Selected applicants will be subject to a background investigation conducted by or on behalf of the Federal Government, and must meet eligibility requirements* for access to classified matter. This position requires a Q clearance. and obtaining such clearance requires ship except in extremely rare circumstances. Dependent upon the position, additional authorization to access classified information may be required, which may or may not be available to dual citizens. Receipt of a Q clearance and additional access authorization ultimately is a decision of the Federal Government and not of Triad.

New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing. Although New Mexico and other states have legalized the use of marijuana, use and possession of marijuana remain illegal under federal law. A positive drug test for marijuana will result in termination of employment, even if the use was pre-offer.

Regular position: Term status Laboratory employees applying for regular-status positions are converted to regular status.

Internal Applicants: Regular appointment employees who have served the required period of continuous service in their current position are eligible to apply for posted jobs throughout the Laboratory. If an employee has not served the required period of continuous service, they may only apply for Laboratory jobs with the documented approval of their Division Leader. Please refer to Policy Policy P701 for applicant eligibility requirements.
Equal Opportunity: Los Alamos National Laboratory is an equal opportunity employer. All employment practices are based on qualification and merit, without regard to protected categories such as race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal, state, and local laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to or call opt. 3.

Employment Status Full Time

Appointment Type Regular

Regular

Contact Details

Contact Name Wroblewski, Alex Christopher

Email

Work Telephone
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.