Apply Now

Senior Site Reliability Engineer

Chicago, IL, US • Posted 3 days ago • Updated 4 hours ago

Full Time

On-site

USD $129,000.00 - 160,000.00 per year

Fitment

Dice Job Match Score™

📊 Calculating match score...

Job Details

Skills

Retail
Partnership
Brand
Health Care
Aspen
Scalability
Exceed
Artificial Intelligence
Predictive Analytics
Service Level
Management
Dashboard
Reporting
Performance Tuning
Log Analysis
ROOT
Decision Support
Pattern Recognition
Trend Analysis
Performance Metrics
Capacity Management
Forecasting
Process Improvement
Optimization
Documentation
Communication
Collaboration
Knowledge Sharing
Computer Science
Reliability Engineering
Python
C#
Machine Learning (ML)
Terraform
Cloud Computing
Kubernetes
Microsoft Azure
Amazon Web Services
Computer Networking
IaaS
Analytics
Grafana
Google Cloud
Google Cloud Platform
Problem Solving
Conflict Resolution
Incident Management
Root Cause Analysis
Workflow

Summary

The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 20,000 healthcare professionals and team members with close to 1,500 health and wellness offices across 48 states in four distinct categories: dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, ClearChoice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and Lovet Pet Health Care. Each brand has access to a deep community of experts, tools and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale.

As a Senior Site Reliability Engineer (SRE) at TAG - The Aspen Group, you will be responsible for ensuring the reliability, performance, and scalability of our core systems. This role involves proactively building and managing, monitoring solutions, lead incident response, and continuously optimizing system performance to exceed business objectives. We are actively integrating AI and machine learning into our operational workflows, and you will be on the front lines, leveraging intelligent automation and machine learning to build a proactive resilient infrastructure. This is an opportunity to go beyond SRE by applying cutting-edge technology to solve complex reliability challenges.

Responsibilities:

Intelligent Site Reliability Engineering:

Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks.
Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets.
Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction.

Advanced Observability:

Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events.
Develop and maintain dashboards and reporting tools that provide data-driven insights for actionable troubleshooting recommendations and performance optimization.
Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification.

Proactive Incident Management:

Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification.
Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions.
Conduct post-incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures.

Performance and Capacity Optimization:

Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies.
Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements.
Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads.

Automation and Process Improvement:

Develop and implement automation solutions, including intelligent runbook automation, self-healing systems, and automated incident triage.
Identify and drive process improvements by applying machine learning to operational data for continuous optimization.
Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices.

Collaboration and Communication:

Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies.
Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data-driven insights.
Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization.

Requirements:

Bachelor's degree in computer science or a related technical field.
At least 5 years of experience in Site Reliability Engineering or a similar role.
Strong proficiency in at least one programming language such as Python, Go, or C#
Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting and incident response.
Expertise with infrastructure as code tools such as Terraform
Proven experience working and monitoring container environments such as Cloud Run and Kubernetes.
Hands-on experience using and working within an Azure, AWS, and Google Cloud Platform environment (Google Cloud Platform preferred)
Strong understanding of networking, distributed systems, and cloud infrastructure.
Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, OpenSearch, Sentry, Google Cloud Observability
Excellent problem-solving skills and the ability to work independently and as part of a team.
Experience with incident management, root cause analysis, and automated operational workflows.

Annual pay range: $129,000-$160,000

A generous benefits package that includes paid time off, health, dental, vision, and 401(k) savings plan with match

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10280496
Position Id: a4605131db2bb7c241e20e02b7ad50cc
Posted 3 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Senior Site Reliability Engineer II

Chicago, Illinois

•

Today

At Braze, we have found our people. We're a genuinely approachable, exceptionally kind, and intensely passionate crew. We seek to ignite that passion by setting high standards, championing teamwork, and creating work-life harmony as we collectively navigate rapid growth on a global scale while striving for greater equity and opportunity - inside and outside our organization. To flourish here, you must be prepared to set a high bar for yourself and those around you. There is always a way to con

Full-time

USD 156,364.00 - 279,957.00 per year

Staff Site Reliability Engineer

Chicago, Illinois

•

Today

TransUnion's Job Applicant Privacy Notice Personal Information We Collect Your Privacy Choices Team Overview At TransUnion, this role will report to a DevOps Director. The Site Reliability Engineering team drives reliability strategy, elevates engineering standards, and owns some of the most complex and consequential work on the platform. As a Staff Site Reliability Engineer at TransUnion, you will serve as a senior technical leader and force multiplier on the SRE team. Operating with full aut

Full-time

USD 112,500.00 - 187,500.00 per year

Site Reliability Engineer II

Chicago, Illinois

•

Today

Note: This position follows a hybrid work model, requiring 2 days per week on-site at our corporate office 20 S Wacker Dr, Chicago, IL 60606 We are looking for local candidates in the Chicago area. CME Group is seeking a SRE II to help, build, operate and scale systems in our Clearing portfolio. Clearing SREs work on products and applications related to CME's Globex trading platform. Our systems deliver an exceptional combination of low-latency performance and rock-solid reliability to seamlessl

Full-time

USD 93,900.00 - 156,500.00 per year

Sr Implementation Lead, SRE (CoP)

Chicago, Illinois

•

20d ago

About Northern Trust: Northern Trust, a Fortune 500 company, is a globally recognized, award-winning financial institution that has been in continuous operation since 1889. Northern Trust is proud to provide innovative financial services and guidance to the world's most successful individuals, families, and institutions by remaining true to our enduring principles of service, expertise, and integrity. With more than 130 years of financial experience and over 22,000 partners, we serve the world's

Full-time

USD 164,600.00 - 288,000.00 per year

Search all similar jobs