Site Reliability Engineer

Overview

On Site

USD 50.00 - 60.00 per hour

Full Time

Skills

Service Level

Continuous Integration and Development

Cisco

Scalability

Cisco UCS

FOCUS

Capacity Management

Performance Analysis

Instrumentation

Artificial Intelligence

Machine Learning (ML)

Software Engineering

IT Infrastructure

Science

Information Technology

HPC

IBM

Operating Systems

Writing

Programming Languages

C++

Continuous Integration

Continuous Delivery

GitLab

GitHub

Red Hat Linux

Terraform

Git

Software Development

Development Testing

Golang

Computer Networking

Reliability Engineering

Virtualization

Agile

JIRA

Rally

Linux

Kubernetes

Jenkins

Ansible

DevOps

Python

Cloud Computing

Taxes

Life Insurance

Partnership

Collaboration

Business Transformation

Law

Job Details

Top Skills' Details
Bachelor's degree in Compute Science, Information Technology or related field; or equivalent years of experience in information technology.
Experience deploying and administrating NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM).
5+ year administrating and supporting Linux based operating systems.
Experience writing code in general-purpose programming languages such as: Python, GoLang, C/C++ and using GIT and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins).
Experience in deploying Enterprise Grade Kubernetes cluster (RedHat OpenShift - preferred) and/or Google Anthos.
Sophisticated knowledge of Kubernetes, Dockers, Terraform, Ansible, Jenkins, GitOps, Git, Linux
Software development lifecycle includes design, development, testing, packaging, deployment using Python or Golang
Description
Your Role as a an IT AI Site Reliability Engineer
This group is building, developing, and expanding our artificial intelligence platforms, which will empower the business to fundamentally change the world. You will be an AI Site Reliability Engineer in the IT Infrastructure Services organization. You will use SRE mechanisms to reduce toil and maintain Service Level Objectives (SLOs) for our internal NVIDIA DGX and Cisco-UCS based AI platforms. You will lead, build, and run fully automated pipelines through our Continuous Integration/ Continuous Delivery (CI/CD) system to deliver operational capabilities and improvements.
Responsibilities include
Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
Automate operational capabilities using Python, Ansible, Terraform, Go etc.
Deliver automation through CI/CD pipeline and chatbot etc.
Implement metrics driven processes to ensure service quality targets are met.
Who You Are
You are an experienced Site Reliability Engineer for high performance compute, artificial intelligence, machine learning, and/or integrated computer systems. You have a software engineering approach for solving operational problems. You know HPC and are familiar with Kubernetes. You have experience delivering software solutions and Linux operating systems. You understand IT infrastructure customers and are passionate about diving deep into problems and fixing them.
Our Minimum Requirements include:
Bachelor's degree in Compute Science, Information Technology or related field; or equivalent years of experience in information technology.
Experience deploying and administrating NVIDIA (DGX) or equivalent high-performance-compute (HPC) clusters (e.g. Cray, HPE, IBM).
5+ year administrating and supporting Linux based operating systems.
Experience writing code in general-purpose programming languages such as: Python, GoLang, C/C++ and using GIT and CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins).
Experience in deploying Enterprise Grade Kubernetes cluster (RedHat OpenShift - preferred) and/or Google Anthos.
Sophisticated knowledge of Kubernetes, Dockers, Terraform, Ansible, Jenkins, GitOps, Git, Linux
Software development lifecycle includes design, development, testing, packaging, deployment using Python or Golang
Preferred Qualifications
Master's degree or equivalent experience in relevant field.
Certifications in Linux, Networking, Cloud, or related technologies.
Prior successful experience as a compute or site/systems reliability engineer.
Experience with Kubernetes, Hybrid Cloud, Virtualization, and Container technologies.
Experience with Agile and DevOps operating models, including project tracking tools (e.g., Jira, Rally).
Excellent collaborator who can partner, lead, guide, and communicate advanced technical concepts.
Skills
Devops, Python, Cloud, Linux, Kubernetes, Jenkins, Ansible
Top Skills Details
Devops,Python,Cloud
Experience Level
Expert Level
Pay and Benefits
The pay range for this position is $50.00 - $60.00/hr.
Eligibility requirements apply to some benefits and may depend on your job
classification and length of employment. Benefits are subject to change and may be
subject to specific elections, plan, or program terms. If eligible, the benefits
available for this temporary role may include the following:
Medical, dental & vision
Critical Illness, Accident, and Hospital
401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available
Life Insurance (Voluntary Life & AD&D for the employee and dependents)
Short and long-term disability
Health Spending Account (HSA)
Transportation benefits
Employee Assistance Program
Time Off/Leave (PTO, Vacation or Sick Leave)
Workplace Type
This is a hybrid position in Morrisville,NC.
Application Deadline
This position is anticipated to close on Aug 22, 2025.
>About TEKsystems:
We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company.

The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.

About TEKsystems and TEKsystems Global Services

We're a leading provider of business and technology services. We accelerate business transformation for our customers. Our expertise in strategy, design, execution and operations unlocks business value through a range of solutions. We're a team of 80,000 strong, working with over 6,000 customers, including 80% of the Fortune 500 across North America, Europe and Asia, who partner with us for our scale, full-stack capabilities and speed. We're strategic thinkers, hands-on collaborators, helping customers capitalize on change and master the momentum of technology. We're building tomorrow by delivering business outcomes and making positive impacts in our global communities. TEKsystems and TEKsystems Global Services are Allegis Group companies. Learn more at TEKsystems.com.

The company is an equal opportunity employer and will consider all applications without regard to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About TEKsystems c/o Allegis Group

Share