Overview
On Site
210k - 270k
Full Time
Skills
SAFE
Reliability Engineering
Scalability
High Availability
Operational Excellence
Leadership
Microservices
Cloud Computing
Linux
Git
Continuous Integration
Continuous Delivery
Production Support
Scripting
Python
Terraform
Conflict Resolution
Problem Solving
Amazon Web Services
Kubernetes
Incident Management
JIRA
Service Level
Continuous Improvement
Mentorship
Dashboard
Collaboration
Management
Health Insurance
Professional Development
Job Details
Job Description
A fast-growing tech company, specializing in building a data platform that helps organizations make safe, fair, and compliant decisions, is seeking an experienced Site Reliability Engineering Manager to lead a team responsible for ensuring the reliability, performance, and scalability of their cloud-based services. The role involves managing incident response, improving system observability, and working closely with product and infrastructure teams to maintain high availability and operational excellence.
Required Skills & Experience
Daily Responsibilities:
You will receive the following benefits:
A fast-growing tech company, specializing in building a data platform that helps organizations make safe, fair, and compliant decisions, is seeking an experienced Site Reliability Engineering Manager to lead a team responsible for ensuring the reliability, performance, and scalability of their cloud-based services. The role involves managing incident response, improving system observability, and working closely with product and infrastructure teams to maintain high availability and operational excellence.
Required Skills & Experience
- 8+ years in relevant technical roles, with 4+ years in leadership or management.
- Strong background in designing and managing observability tools like Datadog or Prometheus.
- Experience with containerized microservices on public cloud platform
- Proficient with Linux, Git, and CI/CD pipelines.
- Skilled in on-call production support and incident management.
- Ability to automate tasks and improve reliability using scripting (Python preferred).
- Experience with Infrastructure as Code tools (Terraform, CloudFormation, etc.).
- Strong problem-solving skills and commitment to security best practices.
- Familiarity with AWS, Kubernetes, and event-driven architectures.
- Experience mentoring engineers and leading technical teams.
- Knowledge of incident management and collaboration tools (PagerDuty, Jira).
- Ability to define and track service-level objectives and metrics.
- Participation in continuous improvement.
Daily Responsibilities:
- Lead and mentor the SRE team, helping resolve blockers and grow skills.
- Manage daily incident escalations and coordinate with on-call engineers.
- Collaborate with other managers to define reliability metrics and dashboards.
- Communicate incident updates to stakeholders and support cross-team collaboration.
- Participate in design and infrastructure reviews to embed reliability early.
- Oversee on-call rotations and ensure thorough incident reviews.
- Drive automation projects to remove operational bottlenecks and improve system uptime.
- 210K-270K
- Hybrid
You will receive the following benefits:
- Medical insurance coverage
- Dental benefits
- Vision benefits
- 401(k) retirement plan with company match
- Ongoing professional development opportunities
- Equity ownership options
- Additional perks and benefits
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.