Site Reliability Manager

Overview

On Site

210k - 270k

Full Time

Skills

SAFE

Reliability Engineering

Scalability

High Availability

Operational Excellence

Leadership

Microservices

Cloud Computing

Linux

Git

Continuous Integration

Continuous Delivery

Production Support

Scripting

Python

Terraform

Problem Solving

Conflict Resolution

Amazon Web Services

Kubernetes

Incident Management

JIRA

Service Level

Continuous Improvement

Mentorship

Dashboard

Collaboration

Management

Health Insurance

Professional Development

Job Details

Job Description
A fast-growing tech company, specializing in building a data platform that helps organizations make safe, fair, and compliant decisions, is seeking an experienced Site Reliability Engineering Manager to lead a team responsible for ensuring the reliability, performance, and scalability of their cloud-based services. The role involves managing incident response, improving system observability, and working closely with product and infrastructure teams to maintain high availability and operational excellence.

Required Skills & Experience

8+ years in relevant technical roles, with 4+ years in leadership or management.
Strong background in designing and managing observability tools like Datadog or Prometheus.
Experience with containerized microservices on public cloud platform
Proficient with Linux, Git, and CI/CD pipelines.
Skilled in on-call production support and incident management.
Ability to automate tasks and improve reliability using scripting (Python preferred).
Experience with Infrastructure as Code tools (Terraform, CloudFormation, etc.).
Strong problem-solving skills and commitment to security best practices.

Desired Skills & Experience

Familiarity with AWS, Kubernetes, and event-driven architectures.
Experience mentoring engineers and leading technical teams.
Knowledge of incident management and collaboration tools (PagerDuty, Jira).
Ability to define and track service-level objectives and metrics.
Participation in continuous improvement.

What You Will Be Doing
Daily Responsibilities:

Lead and mentor the SRE team, helping resolve blockers and grow skills.
Manage daily incident escalations and coordinate with on-call engineers.
Collaborate with other managers to define reliability metrics and dashboards.
Communicate incident updates to stakeholders and support cross-team collaboration.
Participate in design and infrastructure reviews to embed reliability early.
Oversee on-call rotations and ensure thorough incident reviews.
Drive automation projects to remove operational bottlenecks and improve system uptime.

The Offer

210K-270K
Hybrid

You will receive the following benefits:

Medical insurance coverage
Dental benefits
Vision benefits
401(k) retirement plan with company match
Ongoing professional development opportunities
Equity ownership options
Additional perks and benefits

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

About Motion Recruitment Partners, LLC

Share