Overview
Skills
Job Details
Site Reliability Engineer (SRE)
Location: Remote (U.S. based)
Employment Type: 6 Month Contract with possible extension
Industry: Automotive / Internal Tools
Work Requirements: , Holders or Authorized to Work in the U.S.
Rate: $55-64.10 HR
Team: Engineering (working alongside DevOps, Product, and Back-End/Front-End developers
<>About Us</>
We're developing a mission-critical internal tool for a major automotive service provider, currently live in 20+ stores with plans to scale to over 2,000 nationwide. With a growing user base and a need for real-time operational insights, we're expanding our engineering team to include a dedicated Site Reliability Engineer (SRE) to help us ensure performance, reliability, and observability at scale.
<>About the Role</>
We're looking for a Site Reliability Engineer with a strong foundation in observability, particularly with DataDog, to partner with our DevOps engineer and broader development team. You'll be instrumental in helping us understand how users interact with our application, how our systems respond in real time, and how we can scale with confidence.
This is a hands-on role that balances infrastructure insight, system reliability, and the strategic implementation of monitoring, alerting, and recovery practices.
<>Key Responsibilities</>
-
Observability & Monitoring
-
Build, refine, and maintain monitoring, alerting, and dashboards in DataDog to surface application and infrastructure performance metrics.
-
Work closely with product and engineering to define and track SLIs/SLOs.
-
Identify and instrument key user interaction points to improve system visibility.
-
-
Infrastructure & Reliability
-
Contribute to system reliability efforts for our Azure-based back-end and Vercel-hosted front-end.
-
Support disaster recovery planning and implementation.
-
Help define best practices for error budgets, incident response, and availability targets.
-
-
Scalability & Performance
-
Assist in load testing initiatives to prepare the application for broader deployment across thousands of locations.
-
Collaborate with DevOps to enhance deployment pipelines and infrastructure scalability.
-
<>Required Qualifications</>
-
3+ years of experience in a Site Reliability Engineering, DevOps, or related role.
-
Hands-on experience with DataDog, including custom dashboards, alerts, and APM features.
-
Strong grasp of observability principles (SLIs/SLOs, alert fatigue, tracing).
-
Working knowledge of Microsoft Azure services and environments.
-
Familiarity with incident response, root cause analysis, and postmortems.
<>Nice-to-Have Experience</>
-
Load testing tools (e.g., k6, Gatling, Locust).
-
Infrastructure as Code (Terraform, Bicep, ARM templates).
-
CI/CD pipeline development and optimization.
-
Vercel deployment and configuration practices.
<>What We Offer</>
-
Opportunity to shape observability and reliability for a high-impact internal tool.
-
Collaborative, low-ego team environment.
-
Remote-friendly work culture.
-
Exposure to modern tech stacks and scalable architecture challenges.