AI Site Reliability Engineer (Remote) - & GCEAD

Overview

On Site
$40 - $50
Contract - W2
Contract - Independent
Contract - 12 Month(s)
Able to Provide Sponsorship

Skills

Agile
Amazon Web Services
Ansible
Application Development
Artificial Intelligence
C
C++
Capacity Management
Cisco
Cisco UCS
Cloud Computing
Collaboration
Communication
Computer Networking
Continuous Delivery
Continuous Integration
Cross-functional Team
Data Science
DevOps
GC
GitHub
Golang
Good Clinical Practice
Google Cloud Platform
HPC
IBM
Instrumentation
JD
Kubernetes
Linux
Machine Learning (ML)
Management
OpenStack
Orchestration
Performance Analysis
PyTorch
Python
Red Hat Linux
Scalability
Service Level
TensorFlow
Terraform
Virtualization

Job Details


Position Summary

We are looking for an AI Site Reliability Engineer to manage, optimize, and scale high-performance compute (HPC) and AI platforms including NVIDIA DGX and Cisco UCS. This role blends SRE principles, AI/ML operationalization, and infrastructure automation for mission-critical environments.


Responsibilities

  • Manage & scale HPC platforms (NVIDIA DGX / Cisco UCS) for AI workloads.

  • Ensure availability, latency, scalability, and efficiency across systems.

  • Drive capacity planning, performance analysis, and instrumentation.

  • Automate infrastructure with Python, Ansible, Terraform, Go.

  • Deliver capabilities via CI/CD pipelines and chatbots.

  • Maintain Service Level Objectives (SLOs).

  • Deploy and manage Enterprise Kubernetes clusters (OpenShift preferred).

  • Implement metrics-driven monitoring and system quality checks.


Mandatory Skills

CategorySkill
ProgrammingPython, GoLang, C/C++
PlatformsNVIDIA DGX, Cisco UCS
ContainersDocker, Kubernetes, RedHat OpenShift, Anthos
AutomationTerraform, Ansible
CI/CDGitLab, GitHub Actions, Jenkins
OSLinux Administration (5+ years)
Cloud/InfraHybrid Cloud, HPC systems
MethodologiesAgile, DevOps, GitOps
Experience5+ years Linux/SRE, 2+ years AI/HPC infra

Preferred Skills

  • Certifications in Linux, Networking, Cloud.

  • HPC experience (Cray, HPE, IBM).

  • Virtualization & Container Orchestration.

    CriteriaWhat Client Will Likely AcceptWhat They Will Reject
    Work Authorization GC-EAD, s (as per your JD)OPT, CPT, TN Visa (for these roles, per your note)
    Experience Level5 8+ years relevant hands-on experience in cloud, AI/ML, or SREEntry-level or purely academic AI/ML experience
    Domain ExpertiseHybrid Cloud (AWS/Google Cloud Platform/OpenStack/Kubernetes), AI Ops, HPC (DGX/UCS)Only application development without infra/ops exposure
    Technical BreadthProven hands-on with Python, GoLang, Terraform, Ansible, CI/CD, Kubernetes, ML frameworks (PyTorch/TensorFlow)Candidates who have just one cloud, no automation tools, or only data science notebooks without deployments
    Mandatory ExposureFor AI SRE: HPC/AI infra (NVIDIA DGX, Cisco UCS), Linux Sysadmin (5+ years)No HPC exposure or generic DevOps without AI workload handling
    Soft SkillsStrong collaboration, Agile/DevOps culture, cross-functional team workWeak communication or no experience in large team environments
    Preferred Add-onsCertifications (Cloud, Linux, Kubernetes), Cisco product familiarityNo certifications + no enterprise-scale work history
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Bridge Flair LLC