Senior System Engineer - DGX Cloud Lepton

Overview

Remote
On Site
USD 184,000.00 per year
Full Time

Skills

Cloud Computing
Research
Innovation
Patch Management
Firmware
Network
Artificial Intelligence
Change Control
ROOT
Mentorship
Hardening
Access Control
Continuous Integration
Continuous Delivery
Change Management
Communication
Documentation
Computer Science
Kubernetes
PSA
Supply Chain Management
Oracle Policy Automation
GPU
MIG
Recruiting
Promotions
SAP BASIS
Law

Job Details

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. DGX Lepton delivers NVIDIA-managed GPU/Kubernetes capacity for AI workloads.

As a Senior System Engineer, you'll own Lepton platform's reliability and ensure security is a first-class part of day-to-day operations. You'll have the autonomy to drive meaningful projects with strong mentorship and support. We practice blameless postmortems, iterate continuously, and encourage thoughtful risk-taking. If you're looking for an impactful, rewarding role, we invite you to apply.

What you'll be doing:
  • Platform fundamentals: design, build, and operate core services and node/cluster foundations for Lepton platform; automate deployments, upgrades, and day-2 operations.
  • Vulnerability & patch management: own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components for Lepton product.
  • Security as a product quality: define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
  • Identity & access stewardship: standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene.
  • Trusted releases: drive change control and release practices that ensure traceability and integrity of what runs in production.
  • Monitoring & incident practice: establish health signals and SLOs; lead investigations, root causes, and follow-through actions that improve both reliability and security.
  • Risk & readiness: partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls.
  • Documentation & mentorship: publish runbooks and standards; review designs and coach engineers on secure operational practices.

What we need to see:
  • 7+ years in systems/platform engineering operating large-scale, production environments.
  • Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
  • Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or the ability to ramp quickly.
  • Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
  • Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

Ways to stand out from the crowd:
  • Hands-on engineering experience of delivering and driving platform security baselines in multi-tenant environments.
  • Production Kubernetes experience (EKS/AKS/GKE) at fundamental level, especially private clusters and PSA restricted defaults.
  • Supply-chain basics at scale: signed images (cosign) enforced via policy-as-code (Kyverno/OPA).
  • Familiarity with NVIDIA GPU platforms (GPU Operator/device plugin, MIG-aware operations)

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 19, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.