AI Site Reliability Engineer (Remote) - & GCEAD - Bridge Flair LLC

Overview

On Site

$40 - $50

Contract - W2

Contract - Independent

Contract - 12 Month(s)

Able to Provide Sponsorship

Skills

Agile

Amazon Web Services

Ansible

Application Development

Artificial Intelligence

C++

Capacity Management

Cisco

Cisco UCS

Cloud Computing

Collaboration

Communication

Computer Networking

Continuous Delivery

Continuous Integration

Cross-functional Team

Data Science

DevOps

GitHub

Golang

Good Clinical Practice

Google Cloud Platform

HPC

IBM

Instrumentation

Kubernetes

Linux

Machine Learning (ML)

Management

OpenStack

Orchestration

Performance Analysis

PyTorch

Python

Red Hat Linux

Scalability

Service Level

TensorFlow

Terraform

Virtualization

Job Details

Position Summary

We are looking for an AI Site Reliability Engineer to manage, optimize, and scale high-performance compute (HPC) and AI platforms including NVIDIA DGX and Cisco UCS. This role blends SRE principles, AI/ML operationalization, and infrastructure automation for mission-critical environments.

Responsibilities

Manage & scale HPC platforms (NVIDIA DGX / Cisco UCS) for AI workloads.
Ensure availability, latency, scalability, and efficiency across systems.
Drive capacity planning, performance analysis, and instrumentation.
Automate infrastructure with Python, Ansible, Terraform, Go.
Deliver capabilities via CI/CD pipelines and chatbots.
Maintain Service Level Objectives (SLOs).
Deploy and manage Enterprise Kubernetes clusters (OpenShift preferred).
Implement metrics-driven monitoring and system quality checks.

Mandatory Skills

Category	Skill
Programming	Python, GoLang, C/C++
Platforms	NVIDIA DGX, Cisco UCS
Containers	Docker, Kubernetes, RedHat OpenShift, Anthos
Automation	Terraform, Ansible
CI/CD	GitLab, GitHub Actions, Jenkins
OS	Linux Administration (5+ years)
Cloud/Infra	Hybrid Cloud, HPC systems
Methodologies	Agile, DevOps, GitOps
Experience	5+ years Linux/SRE, 2+ years AI/HPC infra

Preferred Skills

Certifications in Linux, Networking, Cloud.
HPC experience (Cray, HPE, IBM).

Virtualization & Container Orchestration.

Criteria	What Client Will Likely Accept	What They Will Reject
Work Authorization	GC-EAD, s (as per your JD)	OPT, CPT, TN Visa (for these roles, per your note)
Experience Level	5 8+ years relevant hands-on experience in cloud, AI/ML, or SRE	Entry-level or purely academic AI/ML experience
Domain Expertise	Hybrid Cloud (AWS/Google Cloud Platform/OpenStack/Kubernetes), AI Ops, HPC (DGX/UCS)	Only application development without infra/ops exposure
Technical Breadth	Proven hands-on with Python, GoLang, Terraform, Ansible, CI/CD, Kubernetes, ML frameworks (PyTorch/TensorFlow)	Candidates who have just one cloud, no automation tools, or only data science notebooks without deployments
Mandatory Exposure	For AI SRE: HPC/AI infra (NVIDIA DGX, Cisco UCS), Linux Sysadmin (5+ years)	No HPC exposure or generic DevOps without AI workload handling
Soft Skills	Strong collaboration, Agile/DevOps culture, cross-functional team work	Weak communication or no experience in large team environments
Preferred Add-ons	Certifications (Cloud, Linux, Kubernetes), Cisco product familiarity	No certifications + no enterprise-scale work history

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

AI Site Reliability Engineer (Remote) - & GCEAD

Job Details

Position Summary

Responsibilities

Mandatory Skills

Preferred Skills

About Bridge Flair LLC

Share