Site Reliability Engineer

Overview

On Site
USD 165,000.00 - 185,000.00 per year
Full Time

Skills

Reliability Engineering
Software Development
FOCUS
Leadership
Service Level
Scalability
Regression Testing
Performance Tuning
Performance Metrics
Issue Resolution
Collaboration
Machine Learning (ML)
JIRA
Management
Exceed
Continuous Improvement
Performance Engineering
Computer Science
Cloud Computing
Amazon Web Services
Performance Monitoring
Version Control
Git
GitHub
Terraform
Java
Spring Framework
Microservices
RDBMS

Job Details

Location: Bloomington, MN
Salary: $165,000.00 USD Annually - $185,000.00 USD Annually
Description:
Position Title: Senior Site Reliability Engineer

Work Location: Hybrid in Bloomington, MN

Role Type: Direct Hire

About the Role:

Our client is looking for a Senior Site Reliability Engineer to lead the charge in establishing and driving best practices in system reliability, performance optimization, and observability. With over five years of experience, you will bring deep expertise in software development and infrastructure operations, particularly in building and maintaining scalable, data-intensive systems. Your primary focus will be on defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure our solutions meet rigorous performance standards. You will collaborate closely with cross-functional teams to build observability frameworks that empower teams to monitor, diagnose, and improve system performance proactively. Your leadership and persistence will be crucial in identifying and resolving performance bottlenecks, ensuring long-term scalability and efficiency across our systems.

Key Responsibilities:
  • Collaborate with development and operations teams to design, implement, and maintain observability frameworks that provide deep insights into system performance, particularly for data and ML pipelines.
  • Lead the establishment of Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring they align with business goals and drive continuous performance improvements.
  • Partner with stakeholders to understand system performance requirements and translate them into actionable performance engineering strategies.
  • Proactively identify performance bottlenecks and collaborate with teams to implement solutions that enhance system scalability and reliability.
  • Design and execute performance regression test suites, focusing on data-intensive and ML workloads, to ensure continuous performance optimization.
  • Own the reliability and performance metrics of our systems, driving a culture of performance excellence and proactive issue resolution.
  • Collaborate with subject matter experts to gain a deep understanding of domain-specific performance challenges, particularly in data and ML pipelines.
  • Utilize tools like Datadog, Jira, and GitHub to monitor system performance, manage projects, and track issues, with a strong emphasis on performance-related metrics.
  • Define and monitor success metrics, ensuring our systems consistently meet or exceed performance and reliability targets.
  • Actively contribute to the continuous improvement of performance engineering practices across the team, fostering a culture of excellence in observability and system performance.
  • Perform other duties as assigned.


Qualifications:
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Five years of experience in a site-reliability-focused role responsible for establishing reliability standards in a cloud-native environment.
  • Strong expertise in establishing SLOs/SLIs and building observability frameworks for complex systems.
  • Proficiency with cloud services, particularly AWS, and experience in designing scalable and reliable architectures.
  • Hands-on experience with performance monitoring and observability tools like Datadog.
  • Proficiency in version control systems like Git/GitHub and infrastructure as code tools like Terraform.


Preferred Qualifications:
  • Proficiency in Java programming and hands-on experience with REST, Spring, and microservices development.
  • Proficiency in RDBMS schema design and index utilization.

Contact:

This job and many more are available through The Judge Group. Please apply with us today!
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Judge Group, Inc.