Overview
Skills
Job Details
Role: AI Site Reliability Engineer
Location: 100% Remote
Duration: Long Term
Your Role as an AI Site Reliability Engineer
We are building, developing, and expanding our artificial intelligence platforms, which will empower the business to fundamentally change the world. You will be an AI Site Reliability Engineer in the IT Infrastructure Services organization. You will use SRE mechanisms to reduce toil and maintain Service Level Objectives (SLOs) for our internal NVIDIA DGX and Cisco-UCS based AI platforms. You will lead, build, and run fully automated pipelines through our Continuous Integration/ Continuous Delivery (CI/CD) system to deliver operational capabilities and improvements.
Responsibilities include
- Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
- Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
- Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
- Automate operational capabilities using Python, Ansible, Terraform, Go etc.
- Deliver automation through CI/CD pipeline and chatbot etc.
- Implement metrics driven processes to ensure service quality targets are met.
Who You Are
You are an experienced Site Reliability Engineer for high performance compute, artificial intelligence, machine learning, and/or integrated computer systems. You have a software engineering approach for solving operational problems. You know HPC and are familiar with Kubernetes. You have experience delivering software solutions and Linux operating systems. You understand IT infrastructure customers and are passionate about diving deep into problems and fixing them.
Our Minimum Requirements include:
- Bachelor s degree in computer science, Information Technology or related field; or equivalent years of experience in information technology.
- Experience deploying and administering NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM).
- 5+ years administering and supporting Linux based operating systems.
- Experience writing code in general-purpose programming languages such as: Python, GoLang, C/C++ and using GIT and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins).
- Experience in deploying Enterprise Grade Kubernetes cluster (RedHat OpenShift preferred) and/or Google Anthos.
- Sophisticated knowledge of Kubernetes, Dockers, Terraform, Ansible, Jenkins, GitOps, Git, Linux
- Software development lifecycle includes design, development, testing, packaging, deployment using Python or Golang
Preferred Qualifications
- Master s degree or equivalent experience in relevant field.
- Certifications in Linux, Networking, Cloud, or related technologies.
- Prior successful experience as a compute or site/systems reliability engineer.
- Experience with Kubernetes, Hybrid Cloud, Virtualization, and Container technologies.
- Experience with Agile and DevOps operating models, including project tracking tools (e.g., Jira, Rally).
- Excellent collaborator who can partner, lead, guide, and communicate advanced technical concepts