Google Cloud Platform (Google Cloud Platform) Supercomputer Solutions Support

Overview

Remote
Accepts corp to corp applications
Contract - W2
Contract - Independent
Contract - 7 day((s))

Skills

API
GCP
Application Programming
Google Cloud
Artificial Intelligence
HPC
HCS
Virtual Machine
VM

Job Details

Hello All,

Greetings from Edge Global LLC!

We are currently hiring for a Google Cloud Platform (Google Cloud Platform) Supercomputer Solutions Support position, and I wanted to reach out to see if you may be interested. Please find the job description below for your review.

If you are available, kindly share your updated resume so we can proceed further.

Also, if this role isn't the right fit for you, I'd truly appreciate it if you could forward this opportunity to your friends or colleagues who may be looking for a change.

Position: Google Cloud Platform (Google Cloud Platform) Supercomputer Solutions Support
Location: Remote
Duration: Long-Term

1. Project Overview

Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("Google Cloud Platform") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.

2. Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

2.1. Ongoing Maintenance

  • The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.

2.2. Cluster Toolkit Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.

Ongoing Responsibilities:

  • Stability Testing: Test the stability of new products, beginning with A3U. This includes:
    • Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
    • Setting up and running pairwise tests to identify and report bad nodes.
  • Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
    • Monitoring daily failure chats and flake tools.
    • Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
  • Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
    • Gathering existing documents and identifying information gaps.
    • Creating new documentation and updating existing materials.
    • Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
  • Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
  • Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.

Key Deliverables:

  • HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
  • Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.

2.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables:

  • API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
    • HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
    • Network: NetworkInitialize params.
    • Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
    • Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
  • Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
    • Creating a cluster that consumes a reservation.
    • Creating a cluster with a new network and new storage.
    • Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
    • Destroying all components of an HCS-created cluster.
    • Destroying a cluster while leaving the network and storage intact.
    • Updating a Slurm cluster to add a new reservation to both new and existing partitions.

Thanks and Regards,

Vivek

Edge Global

1604 Spring Hill Road, Suite 221, Vienna, VA 22182

An E Verified Company

Email:

Website:

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.