Site Reliability Engineer

Overview

Remote

On Site

Hybrid

BASED ON EXPERIENCE

Contract - W2

Contract - Independent

Contract - 3+ mo(s)

Skills

Real-time

Recovery

Performance Metrics

Disaster Recovery

Budget

Collaboration

Scalability

Reliability Engineering

DevOps

Dashboard

Software Performance Management

Microsoft Azure

Incident Management

Root Cause Analysis

Load Testing

Terraform

ARM

Continuous Integration

Continuous Delivery

Optimization

Stacks Blockchain

Job Details

Site Reliability Engineer (SRE)

Location: Remote (U.S. based)
Employment Type: 6 Month Contract with possible extension
Industry: Automotive / Internal Tools
Work Requirements: , Holders or Authorized to Work in the U.S.
Rate: $55-64.10 HR
Team: Engineering (working alongside DevOps, Product, and Back-End/Front-End developers

<>About Us</>

We're developing a mission-critical internal tool for a major automotive service provider, currently live in 20+ stores with plans to scale to over 2,000 nationwide. With a growing user base and a need for real-time operational insights, we're expanding our engineering team to include a dedicated Site Reliability Engineer (SRE) to help us ensure performance, reliability, and observability at scale.

<>About the Role</>

We're looking for a Site Reliability Engineer with a strong foundation in observability, particularly with DataDog, to partner with our DevOps engineer and broader development team. You'll be instrumental in helping us understand how users interact with our application, how our systems respond in real time, and how we can scale with confidence.

This is a hands-on role that balances infrastructure insight, system reliability, and the strategic implementation of monitoring, alerting, and recovery practices.

<>Key Responsibilities</>

Observability & Monitoring
- Build, refine, and maintain monitoring, alerting, and dashboards in DataDog to surface application and infrastructure performance metrics.
- Work closely with product and engineering to define and track SLIs/SLOs.
- Identify and instrument key user interaction points to improve system visibility.
Infrastructure & Reliability
- Contribute to system reliability efforts for our Azure-based back-end and Vercel-hosted front-end.
- Support disaster recovery planning and implementation.
- Help define best practices for error budgets, incident response, and availability targets.
Scalability & Performance
- Assist in load testing initiatives to prepare the application for broader deployment across thousands of locations.
- Collaborate with DevOps to enhance deployment pipelines and infrastructure scalability.

<>Required Qualifications</>

3+ years of experience in a Site Reliability Engineering, DevOps, or related role.
Hands-on experience with DataDog, including custom dashboards, alerts, and APM features.
Strong grasp of observability principles (SLIs/SLOs, alert fatigue, tracing).
Working knowledge of Microsoft Azure services and environments.
Familiarity with incident response, root cause analysis, and postmortems.

<>Nice-to-Have Experience</>

Load testing tools (e.g., k6, Gatling, Locust).
Infrastructure as Code (Terraform, Bicep, ARM templates).
CI/CD pipeline development and optimization.
Vercel deployment and configuration practices.

<>What We Offer</>

Opportunity to shape observability and reliability for a high-impact internal tool.
Collaborative, low-ego team environment.
Remote-friendly work culture.
Exposure to modern tech stacks and scalable architecture challenges.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Site Reliability Engineer (SRE)

About INSPYR Solutions

Share