Site Reliability Engineer

Overview

On Site

Full Time

Skills

Communication

Documentation

Workflow

Root Cause Analysis

Stacks Blockchain

Investments

Optimization

Testing

Information Technology

Computer Science

Orchestration

Scripting

Grafana

Cloud Computing

Amazon Web Services

Google Cloud

Google Cloud Platform

PostgreSQL

Performance Tuning

Query Optimization

Incident Management

Operational Excellence

Continuous Delivery

Continuous Integration

Kubernetes

Docker

Terraform

Akka

.NET

Version Control

GitHub

GitLab

Microsoft Azure

DevOps

Python

Bash

Windows PowerShell

Job Details

Job Description

As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management

- Maintain and monitor production systems for availability, latency, and performance.

- Lead incident response efforts, including communication, resolution, and postmortem documentation.

- Design and implement health checks, alerting systems, and automated remediation workflows.

- Drive root cause analysis and implement permanent resolutions for recurring issues.

Observability & Insights

- Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.

- Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.

- Conduct post-incident reviews and use insights to inform future engineering investments.

Performance & Systems Optimization

- Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.

- Work with developers to evolve architecture and improve system throughput, latency, and stability.

- Optimize PostgreSQL performance, queries, and maintenance strategies.

CI/CD & Automation

- Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.

- Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.

- Standardize infrastructure as code practices across environments.

Education and Experience

- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

- Bachelor's degree in information technology, Computer Science, or a related

- Expertise in Kubernetes and container orchestration at scale.

- Strong experience with AKKA.NET or similar actor-based frameworks.

- Proficiency with scripting and automation (Bash, PowerShell, Python).

- Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).

- Hands-on experience with cloud platforms (AWS, Azure, or Google Cloud Platform).

- Strong PostgreSQL knowledge-performance tuning, query optimization, maintenance.

- Proven ability to lead incident management and drive postmortem processes.

- A builder's mindset with high standards for operational excellence and technical ownership.

Preferred Tools & Ecosystem Experience

- CI/CD: GitHub Actions, Azure Pipelines, GitLab CI

- Infrastructure: Kubernetes, Docker, Terraform

- Monitoring: Phobos (AKKA.NET), Datadog, Prometheus

- Source Control: GitHub, GitLab, Azure DevOps

- Programming: C#, Python, Bash, PowerShell

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share