AI Site Reliability Engineer - W2(REMOTE)

Overview

Remote
$40 - $45
Full Time

Skills

Artificial Intelligence
Cisco UCS
Continuous Integration
High Performance Computing
IBM
HPC
Golang
Operational Excellence
Python
Terraform
Linux
Development Testing
Docker
Software Development

Job Details

Job Title: AI Site Reliability Engineer - W2 Only
Location: 100% Remote
Visa: EAD, L2EAD and TN

Role About:

We are seeking an experienced AI Site Reliability Engineer to join our IT Infrastructure Services organization. You will work on building, developing, and scaling our artificial intelligence platforms powered by NVIDIA DGX and Cisco UCS technologies. Leveraging SRE best practices, you will ensure reliability, scalability, and operational excellence while automating deployments and optimizing performance for high-performance computing (HPC) environments.
Required:
  • 5+ years of experience administering and supporting Linux-based operating systems.
  • Hands-on experience deploying/administering NVIDIA DGX or equivalent HPC clusters (Cray, HPE, IBM).
  • Strong programming skills in Python, GoLang, C/C++ with experience in Git and CI/CD tools.
  • Experience with Kubernetes, Docker, Terraform, Ansible, GitOps, and Linux system administration.
  • Knowledge of software development lifecycle, including design, development, testing, and deployment.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.