Site Reliability Engineer / Remote

Remote in Remote, KY, US • Posted 22 hours ago • Updated 10 hours ago
Full Time
On-site
$160000 - $180000/yr
Fitment

Dice Job Match Score™

✨ Finding the perfect fit...

Job Details

Skills

  • Software Engineering
  • Service Level
  • SAFE
  • DevOps
  • Kubernetes
  • Docker
  • Google Cloud
  • Google Cloud Platform
  • Computer Networking
  • Grafana
  • Microservices
  • PostgreSQL
  • Budget
  • IaaS
  • High Availability
  • Continuous Integration
  • Continuous Delivery
  • Pipeline Management
  • Root Cause Analysis
  • Scalability
  • Optimization
  • Cloud Computing
  • Incident Management
  • Management
  • Partnership
  • Collaboration

Summary

Join a fast-moving gaming technology company as a Site Reliability Engineer, ensuring a real-money gaming platform operates with exceptional reliability, performance, and scalability for lotteries and partners worldwide. This full-time role sits at the intersection of software engineering and infrastructure, focused on building resilient systems, automating operations, and maintaining production health across a distributed architecture. You'll partner closely with backend engineers to design fault-tolerant, observable, and scalable systems from day one - owning platform stability, production performance, deployment reliability, and incident response end to end.

This is a high-ownership SRE role where you're not just maintaining infrastructure - you're shaping it. You'll define and maintain Service Level Indicators and Objectives, align error budgets with contractual SLAs, and lead incident response, root cause analysis, and postmortems on a platform where reliability directly impacts real-money gaming experiences. The observability stack is modern and comprehensive, leveraging Grafana, Prometheus, Tempo, and Loki to give you full visibility into system health across all environments. The CI/CD and deployment automation scope is substantial, and you'll have real influence over cloud infrastructure optimization and cost efficiency. What makes this role particularly compelling is the mission-critical nature of the platform - when infrastructure just works, engineers ship faster, deployments are safe and repeatable, and systems scale automatically under load. For an SRE who takes pride in building systems that rarely break and recover quickly when they do, this role is built for you.

Required Skills & Experience
  • 5+ years of experience in SRE, DevOps, or infrastructure engineering
  • Strong experience with Kubernetes, Docker, and cloud platforms with Google Cloud Platform preferred
  • Deep knowledge of distributed systems and networking
  • Experience building CI/CD pipelines and deployment automation
  • Proficiency with observability tools including Grafana, Prometheus, Tempo, and Loki
  • Experience managing production incidents and reliability processes including postmortems
  • Strong troubleshooting and systems thinking skills
  • Strong knowledge of microservices architecture
  • Familiarity with Go
  • Familiarity with service meshes such as Istio
  • Familiarity with managing PostgreSQL at scale

Desired Skills & Experience
  • Experience defining and maintaining SLIs, SLOs, and error budgets aligned to contractual SLAs
  • Background optimizing cloud infrastructure usage and cost efficiency
  • Experience managing secrets, environment configuration, and deployment safety in regulated or high-availability environments
  • Prior experience in gaming, fintech, or other mission-critical real-money platforms

What You Will Be Doing

Tech Breakdown
  • 35% Platform Reliability and Infrastructure - uptime ownership, architecture design, and production health
  • 25% CI/CD and Deployment Automation - pipeline management, release automation, and deployment safety
  • 25% Observability and Incident Response - monitoring, logging, alerting, root cause analysis, and postmortems
  • 15% Scalability and Cost Optimization - performance improvements, automation, and cloud efficiency

Daily Responsibilities
  • 50% Infrastructure and Platform Ownership - reliability, deployment, configuration, and production readiness
  • 30% Observability and Incident Management - monitoring systems, incident response, and SLO management
  • 20% Engineering Partnership and Automation - collaborating with backend teams, reducing manual intervention, and optimizing operations
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10105282
  • Position Id: 870492
  • Posted 22 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Today

Easy Apply

Full-time

$170000 - $180000

Remote or Eden Prairie, Minnesota

2d ago

Full-time

USD 134,600.00 - 230,800.00 per year

Remote or Chicago, Illinois

10d ago

Full-time

USD 100,000.00 - 170,500.00 per year

Remote or Illinois

Today

Full-time

USD 100,000.00 - 170,500.00 per year

Search all similar jobs