Apply Now

Senior DevOps Lead Engineer (AI Acceleration)- Hybrid

Hybrid in santa clara, CA, US • Posted 3 days ago • Updated 1 hour ago

Contract W2

On-site

$80-110/hr

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

Leadership
GitHub
Stacks Blockchain
PyTorch
TensorFlow
Parallel Computing
Testing
Configuration Management
Terraform
Amazon Web Services
Microsoft Azure
Google Cloud Platform
Google Cloud
Provisioning
Build Automation
RADIUS
Shared Services
Incident Management
Cloud Computing
Network
Collaboration
Documentation
Onboarding
API
Innovation
Research and Development
POC
Computer Science
Electrical Engineering
Linux
TCP/IP
Routing
Storage
Performance Tuning
Git
Continuous Integration
Continuous Delivery
Workflow
Caching
Bash
Scratch
Mentorship
Ansible
Docker
Management
Computer Networking
Kubernetes
Debugging
SAN
RBAC
Grafana
Instrumentation
Dashboard
Artificial Intelligence
Machine Learning (ML)
Computer Hardware
Startups
Solaris
IT Management
Python
DevOps
Microservices
Scripting
Nexus
GitLab
Job Scheduling
LSF
HPC
CPU
GPU
InfiniBand
Remote Direct Memory Access
Ethernet
Writing
Technical Writing
Blogging

Summary

The Role

You will be the senior DevOps technical lead on the Infrastructure team, owning the CI/CD pipelines, container infrastructure, observability stack, and shared tooling that AI/ML hardware accelerator development runs on in the lab, in the cloud, and across colocations at scale.

Because we design and manufactures AI acceleration silicon, a core part of this is working with internal cloud and lab physical systems: automating and operating on-premises GPU clusters, high-speed interconnects, and lab server infrastructure not just cloud resources. You will build the automation layer that ties lab hardware, cloud environments, and developer tooling into a single, reliable system.

You will also be instrumental in scaling that system globally, as they build toward a follow-the-sun DevOps model across its expanding engineering sites.

What You Will Do
DevOps Leadership
Own CI/CD pipelines, runners, and execution environments across software, silicon, hardware, and ML teams GitLab CI, GitHub Actions, and build systems like Bazel.
Build and maintain automated provisioning and deployment pipelines for GPU driver stacks, AI/ML frameworks (PyTorch, TensorFlow), and inference software; implement container-based test harnesses (Docker/Kubernetes/Singularity) that verify driver and framework compatibility across hardware generations (NVIDIA, AMD, Intel).
Improve pipeline performance through parallelization, caching, and architectural changes; maintain the Docker image library supporting AI/ML workload testing across distributions and framework versions.

Automation & Infrastructure as Code
Own IaC and configuration management (Terraform, Ansible, Python, Go, Bash) across lab, on-prem, colo, and cloud (AWS, Azure, Google Cloud Platform) covering GPU/CPU driver provisioning through infrastructure deployments, with remote state management, environment isolation, and plan validation.
Build automation to eliminate toil and enforce consistency across team workflows; implement auto-remediation where appropriate with blast-radius controls and approval gates for production systems.
Operate and automate Kubernetes clusters and HPC container environments (Singularity/Apptainer) across cloud and on-premises installation, upgrades, workload management, and troubleshooting.

Observability, Reliability & Incident Response
Design and maintain dashboards, alerting, and monitoring (PrometheGrafana, DataDog) across CI runners, lab hardware, GPU utilization, and shared services; define SLOs/SLIs and lead structured incident response when they are breached.
Lead incident triage from bare metal to application layer resolving infrastructure, software, and hardware faults across CI/CD, lab, container, and cloud environments, including GPU drivers, framework crashes, and network issues.

Documentation & Global Collaboration
Create and maintain high-quality documentation: architecture diagrams, troubleshooting guides, onboarding materials, and API/tool references.
Partner with Global DevOps and SRE team members to build a consistent, scalable operating model.
Serve as a technical resource across engineering teams developing and sharing best practices, raising technical debt and reliability risks early, and always coming with a proposed plan.
Drive innovation by supporting R&D activities and leading proof-of-concept (POC) and proof-of-value (POV) evaluations for new tooling, infrastructure patterns, and accelerator technologies.

What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field with 10+ years of hands-on DevOps/infrastructure experience (8 years minimum).
Deep Linux systems expertise: package management, networking (TCP/IP stack, routing, bonding), storage, systemd, kernel parameters, and performance tuning.
Production-grade Git based CI/CD experience: pipeline design, runner management, merge request workflows, caching, and artifact handling.
Strong Python and/or Bash scripting for automation, with the ability to write clean, tested, maintainable code not just one-off scripts.
Hands-on Ansible experience writing playbooks from scratch for complex, multi-host configuration scenarios and mentoring team members on Ansible and IaC best practices.
Docker/container expertise: multi-stage builds, registry management, security scanning, and container networking.
Kubernetes operational experience: cluster lifecycle, workload debugging, storage, networking, and RBAC.
Prometheus + Grafana observability stack: metric instrumentation, alert design, and dashboard development.
Experience supporting AI/ML or HPC workloads on GPU or accelerator hardware including driver installation, framework compatibility, and hardware-level troubleshooting.
Comfort operating in fast-moving startups: you ship, document, and iterate not wait for perfect requirements.
Cross-site or follow-the-sun DevOps technical leadership experience.

Strongly Preferred
Production Go and/or Python for DevOps services pipeline validators, health-check microservices, or auto-remediation agents beyond scripting.
Experience with artifact repositories such as Harbor, Nexus, Artifactory, or GitLab Package Registry.
Job scheduling systems: Slurm, LSF, or similar HPC-style cluster job control.
Knowledge of CPU/GPU architectures and high-speed interconnect fabrics: InfiniBand, RoCE (RDMA over Converged Ethernet), or NVLink.
Prior experience speaking at technical conferences or writing public-facing technical documentation/blog posts

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: cxbcsi
Position Id: Job44645
Posted 3 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Cloud Infrastructure Engineer

Mountain View, California

•

Today

Who we are Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent deliveries while streamlining freight movement by reducing congestion. The company focuses on short-haul, B2B logistics for Fortune 500 retailers and in 2021 launched the world's first fully driverless commercial transportation service with Walmart. Gatik's Class 3-7 autonomous trucks are comm

Full-time

USD 180,000.00 - 240,000.00 per year

DevOps Engineer, Retail & Marcom Engineering

Sunnyvale, California

•

Today

Join us, Retail Engineering Foundations, the team that works on the foundational infrastructure & platforms for software used in Apple Stores and Apple Store Online!\\n\\nWe are seeking DevOps Engineers who are passionate about operational excellence through automation and engineering procedures to work closely with other DevOps Engineers, SREs, Software Engineers, Project & Product Managers, and other internal & external partners. You will play a crucial role in ensuring the integration of deve

Full-time

Sr Staff Software Engineer - DevOps

San Jose, California

•

Today

At Bloom Energy, our vision for a world powered by clean, reliable, and affordable energy is more than just a dream-we're making it reality. For over two decades, we've been at the forefront of the global energy transition, pioneering solutions that empower critical industries to thrive in a rapidly digitizing, energy-intensive world. From revolutionizing power for AI-driven data centers to ensuring resilience for hospitals, electric grids, manufacturing facilities, and utilities, our solid oxi

Full-time

USD 134,500.00 - 193,500.00 per year

DevOps CI/CD Software Engineering Manager

San Jose, California

•

Today

Job Details: Job Description: Altera is a global leader in programmable logic solutions, delivering cutting-edge FPGA, SoC, and software technologies that enable innovation across data centers, AI, automotive, communications, and embedded systems. We are committed to building highly scalable, resilient, and intelligent software platforms that power next-generation semiconductor design at global scale. Role Overview We are seeking a DevOps CI/CD Software Engineering Manager to lead the design

Full-time

USD 187,000.00 - 270,700.00 per year

Search all similar jobs