Site Reliability Engineer, Consultant

Overview

On Site
Full Time

Skills

Software Engineering
System Administration
Cross-functional Team
IaaS
High Availability
Scalability
Stacks Blockchain
Recovery
Terraform
Root Cause Analysis
Performance Tuning
Network
Database
Capacity Management
Forecasting
Load Testing
Regulatory Compliance
Software Development Methodology
Software Architecture
Engineering Design
SAFE
Computer Science
Microsoft Azure
Amazon Web Services
Google Cloud Platform
Google Cloud
Virtual Machines
Computer Networking
Identity Management
Storage
Scripting
Python
Java
Bash
Windows PowerShell
Orchestration
Kubernetes
Docker
Red Hat Linux
Management
Grafana
New Relic
Dynatrace
Splunk
SolarWinds
Dashboard
Jenkins
GitHub
GitLab
Continuous Integration
Continuous Delivery
Configuration Management
Ansible
Progress Chef
Puppet
Artificial Intelligence
Incident Management
Optimization
Testing
CHAOS
Cloud Computing
Health Care
Innovation
Collaboration
FOCUS
PASS

Job Details

Job Description

Your Role

We are seeking an Experienced Site Reliability Engineer (SRE) to lead reliability, scalability, and performance initiatives across our production systems. In this role, you will blend software engineering, automation, and systems operations to ensure that our platforms are resilient, efficient, and continuously improving.
You will be part of a cross-functional team responsible for designing, implementing, and maintaining reliable systems that support millions of requests daily. This position requires a deep understanding of distributed systems, cloud infrastructure, automation, and incident response.

Responsibilities

Your Work

In this role, you will
  • Reliability & Uptime: Design and maintain systems to achieve high availability (99.9%+), scalability, and resilience.
  • Monitoring & Observability: Build and improve monitoring stacks using tools like Prometheus, Grafana, Datadog, or New Relic.
  • Automation: Reduce manual toil by automating deployments, scaling, and recovery processes using IaC (Terraform or CloudFormation).
  • Incident Management: Lead and respond to production incidents, perform root-cause analysis, and drive postmortems and prevention strategies.
  • Performance Optimization: Identify system bottlenecks and improve performance across compute, network, and database layers.
  • Capacity Planning: Forecast growth, conduct load testing, and ensure services can handle future demand.
  • Security & Compliance: Implement best practices for infrastructure security, secrets management, and compliance requirements.
  • Collaboration: Partner with developers to embed reliability practices into the SDLC, CI/CD pipelines, and application architecture.
  • Chaos Engineering: Design and execute chaos testing experiments to proactively identify weaknesses in distributed systems and improve overall resilience.
  • Deployment Strategies: Implement and manage Blue/Green and Canary deployment methodologies to minimize risk and ensure safe, incremental rollouts of new features and updates.

Qualifications

Your Knowledge and Experience

Education & Experience
  • Requires a Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience); Master's degree a plus.
  • 7+ years of experience in building, supporting, and improving production systems and infrastructure.

Cloud Platforms
  • Minimum 5 years of hands-on experience with Azure, AWS, or Google Cloud Platform.
  • Demonstrated expertise in virtual machines (VMs), containers, cloud networking, identity and access management (IAM), monitoring, storage, and serverless functions.
  • Comfortable deploying and managing cloud-native services and infrastructure.

Programming & Scripting
  • Proficiency in one or more languages such as Python, Go, Java, Bash, PowerShell, or similar.
  • Ability to write clean, maintainable code for automation and tooling.

Containerization & Orchestration
  • Experience working with Kubernetes, Docker, and tools like Helm or Red Hat OpenShift.
  • Familiarity with managing containerized applications in production environments.

Monitoring & Observability
  • Working knowledge of tools such as Prometheus, Grafana, Datadog, New Relic, ELK Stack, Dynatrace, Splunk, Big Panda, SolarWinds.
  • Ability to set up dashboards, alerts, and metrics to ensure system health and performance.

CI/CD & Configuration Management
  • Experience with CI/CD pipelines using tools like Jenkins, GitHub Actions, GitLab CI, Argo CD, Spinnaker.
  • Familiarity with configuration management tools such as Ansible, Chef, Puppet.

Automation & Emerging Technologies
  • Understanding of Agentic AI systems and automation frameworks for incident response and infrastructure optimization is a plus.
  • Interest in exploring intelligent automation to improve reliability and reduce manual toil.

Testing & Deployment Expertise
  • Experience with chaos engineering tools (e.g., Gremlin, Chaos Monkey) and methodologies.
  • Hands-on knowledge of Blue/Green and Canary deployment strategies in cloud-native environments.

#LI-EB1

About the Team

About Stellarus and the Ascendiun Family of Companies

Stellarus, launched in January 2025, is designed to scale innovative healthcare solutions that support customers in creating a health care experience deserving of their family, friends, and neighbors.

Stellarus is part of a family of organizations that is overseen by a nonprofit corporate entity named Ascendiun. The Ascendiun Family of Companies also includes Blue Shield of California and its subsidiary, Blue Shield of California Promise Health Plan and Altais, a clinical services company.

Stellarus' vision is to empower its customers to create a healthcare experience that is worthy of their family, friends, and neighbors. Stellarus' objective is to offer innovative, modern, scalable solutions that challenge the health care status quo. This very closely aligns with Blue Shield of California's vision by using innovation to improve quality, affordability, and experience for members.

To achieve our mission, we foster an environment where all employees can thrive and contribute fully to address the needs of the various communities we serve. We are committed to creating and maintaining a supportive workplace that upholds our values and advances our goals.

Our Values:

At Stellarus, our core values of agility, trust, drive, courage and service shape our approach to developing innovative product offerings.

Our Workplace Model:

At Stellarus and the Ascendiun Family of Companies, we believe in fostering a workplace environment that balances purposeful in-person collaboration with flexibility. As we continue to evolve our workplace model, our focus remains on creating spaces where our people can connect with purpose - whether working in the office or through a hybrid approach - by providing clear expectations while respecting the diverse needs of our workforce.

Two Ways of Working:
  • Hybrid (Default): Work from a business unit-approved office at least two (2) times per month (for roles below Director-level) or once per week (for Director-level roles and above).Exceptions:

o Member-facing and approved out-of-state roles remain remote.

o Employees living more than 50 miles from their assigned offices are expected to work with their managers on a plan for periodic office visits.

o For employees with medical conditions that may impact their ability to work in-office, we are committed to engaging in an interactive process and providing reasonable accommodations to ensure their work environment is conducive to their success and well-being.
  • On-Site: Work from a business unit-approved office an average of four (4) or more days a week.

Physical Requirements:

Office Environment - roles involving part to full time schedule in Office Environment. Based in our physical offices and work from home office/deskwork - Activity level: Sedentary, frequency most of work day.

Please click here for further physical requirement detail.

Equal Employment Opportunity:

External hires must pass a background check/drug screen. Qualified applicants with arrest records and/or conviction records will be considered for employment in a manner consistent with Federal, State and local laws, including but not limited to the San Francisco Fair Chance Ordinance. All qualified applicants will receive consideration for employment without regards to race, color, religion, sex, national origin, sexual orientation, gender identity, protected veteran status or disability status and any other classification protected by Federal, State and local laws.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.