Apply Now

Platform Engineer

San Jose, CA, US • Posted 4 days ago • Updated 2 days ago

Contract Independent

Contract W2

12 Months

No Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

📊 Calculating match score...

Job Details

Skills

HPC
Kubernetes
Grafana
Terraform

Summary

Job Title: - Platform Engineer

Location:- San Jose, CA

Role Type: - 12+ Months (Contract)

Job Description:

We are seeking an AI Infrastructure / Platform Engineer to join our team building and operating large-scale GPU compute infrastructure that powers AI and ML workloads. The ideal candidate should be passionate about software engineering and possess leadership skills to independently deliver on multiple projects. They should be able to communicate effectively and work optimally with their peers within our larger organization.

The Person:

Experience in Platform, Infrastructure, DevOps Engineering.
Deep hands-on experience with Kubernetes and container orchestration at scale.
Proven ability to design and deliver platform features that serve internal customers or developer teams
Experience building developer-facing platforms or internal developer portals (e.g. Custom workflow tooling).

Key Responsibilities:

Build and extend platform capabilities to enable different classes of workloads (e.g., Large-scale AI training, inferencing etc).
Design and operate scalable orchestration systems using Kubernetes across both on-prem and multi-cloud environments.
Develop platform features such as pre-flight health checks, job status monitoring and post-mortem analysis.
Partner with development teams to extend the GPU developer platform with features, APIs, templates, and self-service workflows that streamline job orchestration and environment management.
Apply expertise in storage and networking to design and integrate CSI drivers, persistent volumes, and network policies that enable high-performance GPU workloads.
Production support on large-scale GPU clusters.

Preferred Experience:

Hands-on experience in storage or network engineering within Kubernetes environments (e.g., CSI drivers, dynamic provisioning, CNI plugins, or network policy).
Experience with Infrastructure as Code tools like Terraform.
Background in HPC, Slurm, or GPU-based compute systems for ML/AI workloads.
Practical experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc).
Understanding of machine learning frameworks (PyTorch, vLLM, SGLang, etc.).
High performance network and IB/RDMA tuning.

Academic Credentials:

Bachelor’s or master’s degree in computer science, computer engineering, electrical engineering, or equivalent.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: saicon
Position Id: 8968860
Posted 4 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

San Jose, California

•

3d ago

Title: AI Infrastructure / Platform Engineer - Onsite Mandatory skills: Platform, Infrastructure, DevOps Engineering, Kubernetes, container orchestration, Custom workflow tooling, GPU compute infrastructure, powers AI, ML workloads, software engineering, Large-scale AI training, inferencing, orchestration systems, job status monitoring, post-mortem analysis, APIs, self-service workflows, streamline job orchestration, networking, CSI drivers, dynamic provisioning, CNI plugins, network policy, Inf

Easy Apply

Contract

102 - 107

Site Reliability Engineer

Hybrid in Santa Clara, California

•

Yesterday

Site Reliability Engineer Candidate local to Santa Clara, CA Hybrid model it is must for the candidate to come down for the client F2F interview. As an SRE, youll also be working in conjunction with various teams such as software engineering to deploy these new products and manage our infrastructure, associated processes, and systems. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential. What youll be doing: Design and operate a multi-cluster Kubernet

Easy Apply

Contract, Third Party

Depends on Experience

SRE

Santa Clara, California

•

Yesterday

SREIntroduction:We are looking for a highly skilled Site Reliability Engineer (SRE) to join a fast-paced engineering team focused on building and scaling next-generation infrastructure platforms. This role offers the opportunity to work across Kubernetes, cloud infrastructure, AI-enabled operations, and modern DevOps ecosystems. Responsibilities:Design, build, and manage large-scale multi-cluster Kubernetes platforms across cloud and on-prem environments.Develop and maintain controllers, CRDs, i

Easy Apply

Contract

60 - 70

Site Reliability Engineer (SRE)

San Jose, California

•

Today

Site Reliability Engineer (SRE)As a Site Reliability Engineer, you will collaborate closely with software engineering and infrastructure teams to deploy, scale, and operate cloud-native platforms and services. This role requires strong problem-solving skills, attention to detail, and deep expertise in Kubernetes, automation, and platform reliability. Key ResponsibilitiesDesign, build, and manage a scalable multi-cluster Kubernetes platform capable of provisioning infrastructure, workloads, and c

Easy Apply

Contract

60 - 65

Search all similar jobs

Platform Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs