Senior System Engineer - DGX Cloud Lepton - NVIDIA Corporation

Overview

Remote

On Site

USD 184,000.00 per year

Full Time

Skills

Cloud Computing

Research

Innovation

Patch Management

Firmware

Network

Artificial Intelligence

Change Control

ROOT

Mentorship

Hardening

Access Control

Continuous Integration

Continuous Delivery

Change Management

Communication

Documentation

Computer Science

Kubernetes

PSA

Supply Chain Management

Oracle Policy Automation

GPU

MIG

Recruiting

Promotions

SAP BASIS

Law

Job Details

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. DGX Lepton delivers NVIDIA-managed GPU/Kubernetes capacity for AI workloads.

As a Senior System Engineer, you'll own Lepton platform's reliability and ensure security is a first-class part of day-to-day operations. You'll have the autonomy to drive meaningful projects with strong mentorship and support. We practice blameless postmortems, iterate continuously, and encourage thoughtful risk-taking. If you're looking for an impactful, rewarding role, we invite you to apply.

What you'll be doing:

Platform fundamentals: design, build, and operate core services and node/cluster foundations for Lepton platform; automate deployments, upgrades, and day-2 operations.
Vulnerability & patch management: own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components for Lepton product.
Security as a product quality: define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
Identity & access stewardship: standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene.
Trusted releases: drive change control and release practices that ensure traceability and integrity of what runs in production.
Monitoring & incident practice: establish health signals and SLOs; lead investigations, root causes, and follow-through actions that improve both reliability and security.
Risk & readiness: partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls.
Documentation & mentorship: publish runbooks and standards; review designs and coach engineers on secure operational practices.

What we need to see:

7+ years in systems/platform engineering operating large-scale, production environments.
Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or the ability to ramp quickly.
Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

Ways to stand out from the crowd:

Hands-on engineering experience of delivering and driving platform security baselines in multi-tenant environments.
Production Kubernetes experience (EKS/AKS/GKE) at fundamental level, especially private clusters and PSA restricted defaults.
Supply-chain basics at scale: signed images (cosign) enforced via policy-as-code (Kyverno/OPA).
Familiarity with NVIDIA GPU platforms (GPU Operator/device plugin, MIG-aware operations)

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 19, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Senior System Engineer - DGX Cloud Lepton

Job Details

Share