Join a fast-moving gaming technology company as a Site Reliability Engineer, ensuring a real-money gaming platform operates with exceptional reliability, performance, and scalability for lotteries and partners worldwide. This full-time role sits at the intersection of software engineering and infrastructure, focused on building resilient systems, automating operations, and maintaining production health across a distributed architecture. You'll partner closely with backend engineers to design fault-tolerant, observable, and scalable systems from day one - owning platform stability, production performance, deployment reliability, and incident response end to end.
This is a high-ownership SRE role where you're not just maintaining infrastructure - you're shaping it. You'll define and maintain Service Level Indicators and Objectives, align error budgets with contractual SLAs, and lead incident response, root cause analysis, and postmortems on a platform where reliability directly impacts real-money gaming experiences. The observability stack is modern and comprehensive, leveraging Grafana, Prometheus, Tempo, and Loki to give you full visibility into system health across all environments. The CI/CD and deployment automation scope is substantial, and you'll have real influence over cloud infrastructure optimization and cost efficiency. What makes this role particularly compelling is the mission-critical nature of the platform - when infrastructure just works, engineers ship faster, deployments are safe and repeatable, and systems scale automatically under load. For an SRE who takes pride in building systems that rarely break and recover quickly when they do, this role is built for you.
Required Skills & Experience - 5+ years of experience in SRE, DevOps, or infrastructure engineering
- Strong experience with Kubernetes, Docker, and cloud platforms with Google Cloud Platform preferred
- Deep knowledge of distributed systems and networking
- Experience building CI/CD pipelines and deployment automation
- Proficiency with observability tools including Grafana, Prometheus, Tempo, and Loki
- Experience managing production incidents and reliability processes including postmortems
- Strong troubleshooting and systems thinking skills
- Strong knowledge of microservices architecture
- Familiarity with Go
- Familiarity with service meshes such as Istio
- Familiarity with managing PostgreSQL at scale
Desired Skills & Experience - Experience defining and maintaining SLIs, SLOs, and error budgets aligned to contractual SLAs
- Background optimizing cloud infrastructure usage and cost efficiency
- Experience managing secrets, environment configuration, and deployment safety in regulated or high-availability environments
- Prior experience in gaming, fintech, or other mission-critical real-money platforms
What You Will Be Doing Tech Breakdown
- 35% Platform Reliability and Infrastructure - uptime ownership, architecture design, and production health
- 25% CI/CD and Deployment Automation - pipeline management, release automation, and deployment safety
- 25% Observability and Incident Response - monitoring, logging, alerting, root cause analysis, and postmortems
- 15% Scalability and Cost Optimization - performance improvements, automation, and cloud efficiency
Daily Responsibilities
- 50% Infrastructure and Platform Ownership - reliability, deployment, configuration, and production readiness
- 30% Observability and Incident Management - monitoring systems, incident response, and SLO management
- 20% Engineering Partnership and Automation - collaborating with backend teams, reducing manual intervention, and optimizing operations