ML Ops Infrastructure Engineer with NVIDIA

Remote • Posted 8 hours ago • Updated 7 hours ago
Full Time
Part Time
Remote
Fitment

Dice Job Match Score™

👾 Reticulating splines...

Job Details

Skills

  • Management
  • Lifecycle Management
  • CUDA
  • MIG
  • Provisioning
  • RBAC
  • Computer Networking
  • Border Gateway Protocol
  • VLAN
  • Remote Direct Memory Access
  • InfiniBand
  • Load Balancing
  • Firewall
  • Policy Administration
  • Ceph
  • HPC
  • Performance Tuning
  • Capacity Management
  • Recovery
  • Machine Learning Operations (ML Ops)
  • NIST SP 800 Series
  • Auditing
  • Terraform
  • Ansible
  • Linux Administration
  • Red Hat Enterprise Linux
  • Linux
  • Ubuntu
  • Storage
  • Optimization
  • Service Management
  • Communication
  • Regulatory Compliance
  • Documentation
  • Adobe AIR
  • USB
  • Media
  • Enterprise Software
  • Training
  • PyTorch
  • Log Management
  • Artificial Intelligence
  • Machine Learning (ML)
  • Kubernetes
  • Stacks Blockchain
  • Grafana
  • GPU
  • Dashboard
  • Security Clearance
  • SCADA
  • ICS
  • Network

Summary

Required Qualifications

  • 6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production.
  • Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes.
  • Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations.
  • Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management.
  • Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts - performance tuning, capacity planning, and failure recovery.
  • Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines.
  • Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence.
  • Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds.
  • Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management.
  • Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment.

Nice to Have

  • Experience with air-gapped or classified network environments and the operational discipline they require (offline package mirrors, USB-controlled media transfers, etc.).
  • Familiarity with CMMC Level 2/3 assessment processes and evidence collection.
  • Experience with NVIDIA DGX Systems, BasePOD reference architectures, or NVIDIA AI Enterprise software stack.
  • Knowledge of distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM) and their infrastructure requirements - useful for supporting AI/ML engineering teammates.
  • Experience deploying Kubernetes at the edge: K3s, MicroK8s, or NVIDIA Jetson-based edge clusters.
  • Familiarity with observability stacks: Prometheus, Grafana, Loki, OpenTelemetry, and DCGM Exporter for GPU telemetry dashboards.
  • US Person status or active security clearance - advantageous for certain client site engagements.
  • Background in SCADA, ICS, or OT network environments relevant to critical infrastructure clients.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: RTX1ce8c7
  • Position Id: OOJ - 1384-388-1777990450
  • Posted 8 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Today

Full-time

120-130K

Remote

Today

Full-time

Remote

14d ago

Easy Apply

Full-time

120,000 - 135,000

Remote or Jersey City, New Jersey

Today

Full-time

USD 142,320.00 - 213,480.00 per year

Search all similar jobs