Apply Now

Site Reliability Engineer Manager- Hybrid

Hybrid in santa clara, CA, US • Posted 4 days ago • Updated 6 hours ago

Contract W2

On-site

$90-120/hr

Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

Reliability Engineering
Technical Direction
DevOps
Team Leadership
Operational Excellence
Continuous Improvement
Software Development
Change Control
Problem Management
Artificial Intelligence
Analytics
Root Cause Analysis
Capacity Management
Provisioning
Collaboration
Documentation
Solaris
POC
Roadmaps
Computer Science
Electrical Engineering
Production Engineering
Recruiting
Linux
TCP/IP
Remote Direct Memory Access
Computer Hardware
Computer Networking
Fluency
Terraform
Ansible
Kubernetes
Lifecycle Management
Storage
RBAC
Grafana
Python
Scripting
Workflow
Orchestration
Communication
Operational Risk
Leadership
Management
Customer Facing
InfiniBand
Performance Tuning
LSF
Amazon Web Services
Microsoft Azure
Google Cloud
Google Cloud Platform
Modeling
Cloud Computing
ITIL
Change Management
Technical Writing
Open Source
HPC

Summary

The Role
You will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, Google Cloud Platform), and the platform services customers use to collaborate on hardware and software deployments.

You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLOs for critical systems, and be the senior escalation point when things go wrong. You will work closely with hardware and software development teams to ensure HPC infrastructure meets their workload requirements and partner with the Senior DevOps Lead whose pipelines and automation run on the infrastructure you own.

What You Will Do
Team Leadership & Strategy
Develop and manage a team of 3 5 SRE engineers; establish a culture of operational excellence, ownership, and continuous improvement.
Define the SRE team's technical roadmap: reliability architecture, automation priorities, capacity planning, and on-call model.
Serve as the senior technical escalation for critical incidents guiding cross-team triage, driving RCA, and ensuring systemic fixes rather than point patches.
Translate operational signals and infrastructure health into clear, actionable narratives for engineering leadership and executive stakeholders.
Partner with hardware and software development teams to understand HPC workload requirements and ensure infrastructure capacity, performance, and reliability meet the needs of silicon and software development programs.

24 x7 Infrastructure Reliability & Observability
Own 24 7 reliability across colocation, on-premises lab clusters, cloud, and customer-facing platform services designing for failure domains, progressive delivery, and strict change control at every tier.
Own the full observability stack (metrics, traces, logs) and define SLOs/SLIs across all SRE systems; use AI-driven detection, correlation, and guided remediation to reduce time to detect, respond, and resolve.
Evolve incident and problem management into a data-driven discipline: automated triage workflows, AI/analytics to identify recurring patterns, and every P0/P1 producing a written RCA with tracked systemic fixes.
Lead FinOps and capacity planning: model TCO across cloud vs. on-prem vs. colo, drive workload placement decisions, and anticipate infrastructure needs for new silicon programs and customer deployments.
Own infrastructure for customer collaboration environments where partners deploy and validate hardware and software.

Automation & Infrastructure as Code
Drive IaC-first discipline across the team Terraform, Ansible, and production-quality automation for all infrastructure provisioning and lifecycle management.
Build and mature self-healing infrastructure platforms: host lifecycle automation, fleet auto-remediation, and AIOps-driven alerting that reduce manual intervention across the operational lifecycle.

Documentation & Global Collaboration
Build a documentation culture and scale a follow-the-sun on-call model as we expands globally runbooks, architecture diagrams, and operational playbooks maintained as living artifacts.
Drive POC and POV evaluations for new infrastructure technologies, interconnect fabrics, and platform services relevant to our accelerator roadmap.

What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field; 12+ years in SRE, infrastructure engineering, or production engineering (8 years minimum).
3+ years managing SRE or infrastructure teams hiring, growing, and retaining engineers in a fast-moving environment.
Deep Linux systems expertise: networking (TCP/IP, RDMA, bonding), storage, kernel tuning, and bare-metal operations.
Proven experience operating colocation and on-premises hardware at scale: server lifecycle, power and cooling awareness, rack-level networking.
IaC fluency: Terraform and Ansible at production scale module design, remote state, environment isolation, and change governance.
Kubernetes cluster operations: lifecycle management, workload reliability, storage, and RBAC at scale.
Full observability stack ownership: Prometheus, Grafana, and/or DataDog SLO definition, alert design, and E2E signal quality.
Strong Python and/or Go production services, not just scripts; automation that touches real infrastructure safely.
Track record of reducing MTTR/MTTD through automation, workflow orchestration, and AIOps tooling.
Executive communication: translating infrastructure health and operational risk into clear narratives for senior leadership.
Demonstrated track record of moving teams from reactive, process-heavy operations to automated, technology-focused models not just managing existing runbooks.

Strongly Preferred
Experience operating customer-facing infrastructure or platform services reliability expectations beyond internal tooling.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink setup, troubleshooting, and performance tuning.
HPC job scheduler experience: Slurm, LSF, or equivalent setup, tuning, and integration with infrastructure automation.
Multi-cloud hybrid operations: AWS, Azure, Google Cloud Platform alongside on-prem/colo unified observability and IaC across all tiers.
FinOps: cloud spend attribution, TCO modeling across cloud vs. on-prem vs. colo, and translating cost data into workload placement recommendations for engineering and executive audiences.
ITIL knowledge or equivalent structured incident/problem/change management framework experience.
Published technical writing, conference talks, or open-source contributions in reliability, observability, or HPC infrastructure.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: cxbcsi
Position Id: Job44644
Posted 4 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Hybrid in Santa Clara, California

•

Today

The Role You will be a core member of the SRE team, responsible for the reliability, automation, and observability of the infrastructure that the company runs on. You will work across colocation, on-premises lab environments, and cloud platforms and you will own your systems end-to-end, from initial provisioning through live incident response. You will partner with hardware and software development teams to support their workload needs, including CI/CD pipelines and automation layer and the ass

Contract

80-100/hr

Senior Site Reliability Engineer, Apple Data Platform Infra SRE

Cupertino, California

•

Today

At Apple, we believe that innovation flourishes in an environment where ideas are challenged, collaboration is encouraged and technology is pushed to its limits. This environment is only possible when diverse minds come together, bringing unique perspectives and experiences. Our people and their ideas inspire innovation in everything we do. Imagine what you could accomplish here! Join Apple and help us make the world a better place.\\n\\nAs a principal contributor in our Apple Data Platform SRE

Full-time

Senior Lead Site Reliability Engineer

San Jose, California

•

Today

Immigration sponsorship is not available for this position What you can expect As a Senior Lead Site Reliability Engineer, you can anticipate opportunities to work on our hybrid systems across the globe. You will be responsible for installing, configuring, and monitoring new systems within a network of global data centers. Additionally, you will patch and maintain thousands of physical and cloud systems worldwide. To streamline operations, you will develop automation to reduce repetitive tasks a

Full-time

Compensation information provided in the description

Lead Site Reliability Engineer

Mountain View, California

•

Today

About Glean: Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor

Full-time

USD 200,000.00 - 260,000.00 per year

Search all similar jobs

Site Reliability Engineer Manager- Hybrid

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs