Apply Now

Site Reliability Engineer (AI Acceleration)- Hybrid

Hybrid in santa clara, CA, US • Posted 4 days ago • Updated 5 hours ago

Contract W2

On-site

$80-100/hr

Fitment

Dice Job Match Score™

✨ Finding the perfect fit...

Job Details

Skills

Software Development
Storage Management
Hardware Troubleshooting
Capacity Management
Reporting
Configuration Management
VLAN
Network
Workflow
Reliability Engineering
Software Deployment
QoS
Collaboration
Documentation
DevOps
Continuous Integration
Continuous Delivery
Computer Science
Electrical Engineering
System Administration
Linux
Computer Hardware
Provisioning
Artificial Intelligence
Terraform
Ansible
Kubernetes
Management
Storage
Computer Networking
Grafana
Dashboard
Writing
Python
Bash
Scripting
Incident Management
Root Cause Analysis
Startups
Customer Facing
IaaS
Amazon Web Services
Microsoft Azure
Google Cloud
Google Cloud Platform
Cloud Computing
HPC
LSF
InfiniBand
Lifecycle Management

Summary

The Role
You will be a core member of the SRE team, responsible for the reliability, automation, and observability of the infrastructure that the company runs on. You will work across colocation, on-premises lab environments, and cloud platforms and you will own your systems end-to-end, from initial provisioning through live incident response.
You will partner with hardware and software development teams to support their workload needs, including CI/CD pipelines and automation layer and the associated CI/CD pipelines and automation layer for software toolling. You will also support customer-facing environments where d-Matrix partners collaborate on hardware and software deployments.

What You Will Do

Infrastructure Operations
Own reliability and availability of assigned infrastructure domains: colo server fleets, on-premises lab clusters, cloud environments (AWS, Azure, Google Cloud Platform), and customer-facing platform services.
Perform hands-on infrastructure work: server provisioning, OS configuration, network setup, storage management, and hardware troubleshooting from bare metal up.
Conduct capacity planning and hardware lifecycle management for assigned infrastructure domains; track and report cloud spend for your domains to support FinOps and workload placement decisions.

Automation & Infrastructure as Code
Own IaC and configuration management (Terraform, Ansible) for your infrastructure domains all provisioning and changes through code, not manual steps.
Build, deploy and document automation to eliminate toil: host lifecycle management, fleet health checks, auto-remediation workflows, and self-service tooling for engineering teams.
Develop networking automations for cluster interconnects, VLAN management, and lab network configurations.
Contribute to shared IaC modules and automation libraries used across the global SRE and data center services teams.
Observability & Incident Response
Design and maintain monitoring dashboards, alerting, and SLIs (PrometheGrafana, DataDog) for your infrastructure domains ensuring signal quality, actionable alerts, and contributing to AIOps-driven detection workflows that reduce time to detect and respond.
Participate in on-call rotation; triage and resolve incidents from bare metal to application layer using structured, AI-assisted workflows distinguishing infrastructure faults from software or hardware product issues and escalating with clear context.
Produce high-quality RCA reports for P0/P1 incidents with root cause analysis and tracked action items.
Detect performance issues, recommend solutions, and implement fixes that permanently improve system reliability.

Customer & Platform Development Services
Support and operate platform services used by both internal teams and external customers for hardware and software deployment collaboration with d-Matrix.
Ensure QoS and uptime commitments for customer-facing environments; escalate reliability risks proactively.
Document platform configurations, access procedures, and operational runbooks for customer environments.

Documentation & Collaboration
Maintain high-quality runbooks, architecture diagrams, and troubleshooting guides documentation is part of the job, not an afterthought.
Partner with the DevOps team to ensure infrastructure reliability supports CI/CD pipeline performance and developer experience.
Serve as a technical resource for engineering teams sharing operational knowledge and raising infrastructure risks early.

What You Will Bring

Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field (or equivalent experience); 5+ years in SRE, infrastructure engineering, or systems administration.
Strong Linux systems knowledge: networking, storage, systemd, package management, kernel parameters, and performance diagnostics.
Hands-on experience with colocation or on-premises server infrastructure physical hardware, rack networking, and bare-metal provisioning.
Hands-on experience deploying and operating AI-driven infrastructure tools AIOps platforms, intelligent alerting, anomaly detection, or LLM-assisted diagnostics in production environments.
IaC experience with Terraform and/or Ansible writing and maintaining production configurations, not just running existing playbooks.
Kubernetes operational experience: cluster troubleshooting, workload management, storage, and networking.
Prometheus + Grafana or DataDog: building dashboards, writing alert rules, and understanding signal quality.
Python and/or Bash scripting: production-quality automation, not just one-off scripts.
Incident response experience: structured triage, RCA production, and follow-through on action items.
Comfort operating in fast-moving startups: you own your systems, document what you build, and iterate without waiting for perfect requirements.

Strongly Preferred
Experience operating customer-facing infrastructure or platform services with external reliability expectations.
Cloud infrastructure operations across AWS, Azure, or Google Cloud Platform including hybrid environments spanning cloud and on-prem.
HPC job scheduler experience: Slurm, LSF, or equivalent operations and troubleshooting.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink configuration and troubleshooting.
Experience with large-scale infrastructure automation: host lifecycle management, fleet auto-healing, or AIOps-driven operations building tooling that reduces manual intervention, not just running it

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: cxbcsi
Position Id: Job44643
Posted 4 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Hybrid in Santa Clara, California

•

Today

The Role You will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, Google Cloud Platform), and the platform services customers use to collaborate on hardware and software deployments. You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLO

Contract

90-120/hr

Senior Site Reliability Engineer - ASE / iCloud

Cupertino, California

•

Today

People at Apple don't just build products - they craft experiences our customers love and depend on. Apple Services Engineering (ASE) builds and supports the systems that make many of these daily experiences possible. If you've used Apple products, you've likely interacted with us. Apple Services Site Reliability Engineering (SRE) teams are responsible for the systems and services that directly support those customers and their experiences. We are looking for an SRE with experience in building a

Full-time

Senior Software Engineer - Compute

Cupertino, California

•

Today

People at Apple don't just build products - they craft the kind of experience that has revolutionized entire industries. The diverse collection of our people and their ideas inspire innovation in everything we do. Imagine what you could do here! Join Apple, and help us leave the world better than we found it.\\n\\nThe Apple Service Engineering (ASE) team builds and provides systems and infrastructure that power Apple's services (such as iCloud, Apple Music, Apple Intelligence, and Maps). We are

Full-time

Senior Site Reliability Engineer, Apple Data Platform Infra SRE

Cupertino, California

•

Today

At Apple, we believe that innovation flourishes in an environment where ideas are challenged, collaboration is encouraged and technology is pushed to its limits. This environment is only possible when diverse minds come together, bringing unique perspectives and experiences. Our people and their ideas inspire innovation in everything we do. Imagine what you could accomplish here! Join Apple and help us make the world a better place.\\n\\nAs a principal contributor in our Apple Data Platform SRE

Full-time

Search all similar jobs

Site Reliability Engineer (AI Acceleration)- Hybrid

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs