Dear Applicant, We are excited to share an excellent opportunity with one of our leading clients for a Google Cloud Platform Supercomputer Solutions Support Engineer (Remote Contract). If this role matches your expertise, please apply with your most updated resume for consideration. |
Job Title: Google Cloud Platform Supercomputer Solutions Support |
Location: Remote Role |
Duration: Contract Position |
- Scope of Work & Deliverables
|
The supplier will be responsible for the services and deliverables detailed below: |
1.1. Ongoing Maintenance: The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work. |
1.2. Cluster Toolkit: Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud. |
Ongoing Responsibilities: |
- Stability Testing: Test the stability of new products, beginning with A3U. This includes:
|
- Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
- Setting up and running pairwise tests to identify and report bad nodes.
|
- Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
|
- Monitoring daily failure chats and flake tools.
- Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
|
- Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
|
- Gathering existing documents and identifying information gaps.
- Creating new documentation and updating existing materials.
- Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
|
- Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
|
- Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
|
Key Deliverables: |
- HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
|
- Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
|
1.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale. |
Key Deliverables: |
- API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
|
- HypercomputeClusters: Create, delete, update, get, and list requests and responses.
- Network: Network Initialize params.
- Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
- Compute: resource request, guest accelerator, disk, provisioning model, reservation affinity and type, orchestrator, Slurm, node test, storage configuration, and Slurm partition.
|
- Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
|
- Creating a cluster that consumes a reservation.
- Creating a cluster with a new network and new storage.
- Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
- Destroying all components of an HCS-created cluster.
- Destroying a cluster while leaving the network and storage intact.
- Updating a Slurm cluster to add a new reservation to both new and existing partitions.
|