Overview
Skills
Job Details
Position Summary
We are looking for an AI Site Reliability Engineer to manage, optimize, and scale high-performance compute (HPC) and AI platforms including NVIDIA DGX and Cisco UCS. This role blends SRE principles, AI/ML operationalization, and infrastructure automation for mission-critical environments.
Responsibilities
Manage & scale HPC platforms (NVIDIA DGX / Cisco UCS) for AI workloads.
Ensure availability, latency, scalability, and efficiency across systems.
Drive capacity planning, performance analysis, and instrumentation.
Automate infrastructure with Python, Ansible, Terraform, Go.
Deliver capabilities via CI/CD pipelines and chatbots.
Maintain Service Level Objectives (SLOs).
Deploy and manage Enterprise Kubernetes clusters (OpenShift preferred).
Implement metrics-driven monitoring and system quality checks.
Mandatory Skills
Category | Skill |
---|---|
Programming | Python, GoLang, C/C++ |
Platforms | NVIDIA DGX, Cisco UCS |
Containers | Docker, Kubernetes, RedHat OpenShift, Anthos |
Automation | Terraform, Ansible |
CI/CD | GitLab, GitHub Actions, Jenkins |
OS | Linux Administration (5+ years) |
Cloud/Infra | Hybrid Cloud, HPC systems |
Methodologies | Agile, DevOps, GitOps |
Experience | 5+ years Linux/SRE, 2+ years AI/HPC infra |
Preferred Skills
Certifications in Linux, Networking, Cloud.
HPC experience (Cray, HPE, IBM).
Virtualization & Container Orchestration.
Criteria What Client Will Likely Accept What They Will Reject Work Authorization GC-EAD, s (as per your JD) OPT, CPT, TN Visa (for these roles, per your note) Experience Level 5 8+ years relevant hands-on experience in cloud, AI/ML, or SRE Entry-level or purely academic AI/ML experience Domain Expertise Hybrid Cloud (AWS/Google Cloud Platform/OpenStack/Kubernetes), AI Ops, HPC (DGX/UCS) Only application development without infra/ops exposure Technical Breadth Proven hands-on with Python, GoLang, Terraform, Ansible, CI/CD, Kubernetes, ML frameworks (PyTorch/TensorFlow) Candidates who have just one cloud, no automation tools, or only data science notebooks without deployments Mandatory Exposure For AI SRE: HPC/AI infra (NVIDIA DGX, Cisco UCS), Linux Sysadmin (5+ years) No HPC exposure or generic DevOps without AI workload handling Soft Skills Strong collaboration, Agile/DevOps culture, cross-functional team work Weak communication or no experience in large team environments Preferred Add-ons Certifications (Cloud, Linux, Kubernetes), Cisco product familiarity No certifications + no enterprise-scale work history