Site Reliability Engineer Manager- Hybrid

Hybrid in santa clara, CA, US • Posted 4 days ago • Updated 6 hours ago
Contract W2
On-site
$90-120/hr
Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

  • Reliability Engineering
  • Technical Direction
  • DevOps
  • Team Leadership
  • Operational Excellence
  • Continuous Improvement
  • Software Development
  • Change Control
  • Problem Management
  • Artificial Intelligence
  • Analytics
  • Root Cause Analysis
  • Capacity Management
  • Provisioning
  • Collaboration
  • Documentation
  • Solaris
  • POC
  • Roadmaps
  • Computer Science
  • Electrical Engineering
  • Production Engineering
  • Recruiting
  • Linux
  • TCP/IP
  • Remote Direct Memory Access
  • Computer Hardware
  • Computer Networking
  • Fluency
  • Terraform
  • Ansible
  • Kubernetes
  • Lifecycle Management
  • Storage
  • RBAC
  • Grafana
  • Python
  • Scripting
  • Workflow
  • Orchestration
  • Communication
  • Operational Risk
  • Leadership
  • Management
  • Customer Facing
  • InfiniBand
  • Performance Tuning
  • LSF
  • Amazon Web Services
  • Microsoft Azure
  • Google Cloud
  • Google Cloud Platform
  • Modeling
  • Cloud Computing
  • ITIL
  • Change Management
  • Technical Writing
  • Open Source
  • HPC

Summary

The Role
You will build and lead the Site Reliability Engineering team, owning the infrastructure that development, validation, and customer-facing deployments run on. This spans colocation facilities, on-premises lab clusters, cloud environments (AWS, Azure, Google Cloud Platform), and the platform services customers use to collaborate on hardware and software deployments.

You are both a people manager and a practicing engineer. You will set technical direction, hire and grow the team, own SLOs for critical systems, and be the senior escalation point when things go wrong. You will work closely with hardware and software development teams to ensure HPC infrastructure meets their workload requirements and partner with the Senior DevOps Lead whose pipelines and automation run on the infrastructure you own.

What You Will Do
Team Leadership & Strategy
Develop and manage a team of 3 5 SRE engineers; establish a culture of operational excellence, ownership, and continuous improvement.
Define the SRE team's technical roadmap: reliability architecture, automation priorities, capacity planning, and on-call model.
Serve as the senior technical escalation for critical incidents guiding cross-team triage, driving RCA, and ensuring systemic fixes rather than point patches.
Translate operational signals and infrastructure health into clear, actionable narratives for engineering leadership and executive stakeholders.
Partner with hardware and software development teams to understand HPC workload requirements and ensure infrastructure capacity, performance, and reliability meet the needs of silicon and software development programs.

24 x7 Infrastructure Reliability & Observability
Own 24 7 reliability across colocation, on-premises lab clusters, cloud, and customer-facing platform services designing for failure domains, progressive delivery, and strict change control at every tier.
Own the full observability stack (metrics, traces, logs) and define SLOs/SLIs across all SRE systems; use AI-driven detection, correlation, and guided remediation to reduce time to detect, respond, and resolve.
Evolve incident and problem management into a data-driven discipline: automated triage workflows, AI/analytics to identify recurring patterns, and every P0/P1 producing a written RCA with tracked systemic fixes.
Lead FinOps and capacity planning: model TCO across cloud vs. on-prem vs. colo, drive workload placement decisions, and anticipate infrastructure needs for new silicon programs and customer deployments.
Own infrastructure for customer collaboration environments where partners deploy and validate hardware and software.

Automation & Infrastructure as Code
Drive IaC-first discipline across the team Terraform, Ansible, and production-quality automation for all infrastructure provisioning and lifecycle management.
Build and mature self-healing infrastructure platforms: host lifecycle automation, fleet auto-remediation, and AIOps-driven alerting that reduce manual intervention across the operational lifecycle.

Documentation & Global Collaboration
Build a documentation culture and scale a follow-the-sun on-call model as we expands globally runbooks, architecture diagrams, and operational playbooks maintained as living artifacts.
Drive POC and POV evaluations for new infrastructure technologies, interconnect fabrics, and platform services relevant to our accelerator roadmap.

What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field; 12+ years in SRE, infrastructure engineering, or production engineering (8 years minimum).
3+ years managing SRE or infrastructure teams hiring, growing, and retaining engineers in a fast-moving environment.
Deep Linux systems expertise: networking (TCP/IP, RDMA, bonding), storage, kernel tuning, and bare-metal operations.
Proven experience operating colocation and on-premises hardware at scale: server lifecycle, power and cooling awareness, rack-level networking.
IaC fluency: Terraform and Ansible at production scale module design, remote state, environment isolation, and change governance.
Kubernetes cluster operations: lifecycle management, workload reliability, storage, and RBAC at scale.
Full observability stack ownership: Prometheus, Grafana, and/or DataDog SLO definition, alert design, and E2E signal quality.
Strong Python and/or Go production services, not just scripts; automation that touches real infrastructure safely.
Track record of reducing MTTR/MTTD through automation, workflow orchestration, and AIOps tooling.
Executive communication: translating infrastructure health and operational risk into clear narratives for senior leadership.
Demonstrated track record of moving teams from reactive, process-heavy operations to automated, technology-focused models not just managing existing runbooks.

Strongly Preferred
Experience operating customer-facing infrastructure or platform services reliability expectations beyond internal tooling.
Knowledge of high-speed interconnect fabrics: InfiniBand, RoCE, or NVLink setup, troubleshooting, and performance tuning.
HPC job scheduler experience: Slurm, LSF, or equivalent setup, tuning, and integration with infrastructure automation.
Multi-cloud hybrid operations: AWS, Azure, Google Cloud Platform alongside on-prem/colo unified observability and IaC across all tiers.
FinOps: cloud spend attribution, TCO modeling across cloud vs. on-prem vs. colo, and translating cost data into workload placement recommendations for engineering and executive audiences.
ITIL knowledge or equivalent structured incident/problem/change management framework experience.
Published technical writing, conference talks, or open-source contributions in reliability, observability, or HPC infrastructure.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: cxbcsi
  • Position Id: Job44644
  • Posted 4 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in Santa Clara, California

Today

Contract

80-100/hr

Cupertino, California

Today

Full-time

San Jose, California

Today

Full-time

Compensation information provided in the description

Mountain View, California

Today

Full-time

USD 200,000.00 - 260,000.00 per year

Search all similar jobs