Senior Site Reliability Engineer (SRE)

Overview

On Site

110,000 - 120,000

Full Time

No Travel Required

Unable to Provide Sponsorship

Skills

Incident Management

Reliability Engineering

Performance Engineering

Java

Kubernetes

Microsoft Azure

DevOps

Prolog

Haskell

Objective Caml

Linux

Scripting

Unix

Bash

Python

VMware

Virtualization

Apache Kafka

Splunk

Docker

Terraform

Azure Certification

Kubernetes Certification

Job Details

Role: Senior Site Reliability Engineer (SRE) – Java / Kubernetes / Azure
Location: Phoenix, AZ (Day 1 Onsite)
Job Type: Full-Time
Pay Rate Range: 105k - 125k/ year

Key Responsibilities:

Provide senior-level SRE support, ensuring system reliability, availability, and operational excellence across all environments.
Develop and maintain services and automation scripts using Java as the primary programming language.
Build, deploy, and optimize workloads running on Kubernetes clusters (including multi-cluster and federated deployments).
Manage and enhance cloud infrastructure leveraging Azure services and best practices.
Work with Linux/Unix systems and develop automation using BASH shell scripting.
Build automation and tooling using Python or Go.
Design, implement, and maintain CI/CD pipelines using GitLab CI/CD and Jenkins.
Support application streaming, event processing, and analytics using Kafka Stream Generator, KSQLDB, and Spark Streams.
Work with service mesh technologies including Istio and understand Anthos Service Mesh.
Utilize VMware and other virtualization platforms for environment provisioning.
Provide robust incident support, root-cause analysis, and production issue resolution.
Implement eBPF-based observability and performance troubleshooting where applicable.
Develop and enhance monitoring and alerting systems using Splunk, Prometheus, Datadog, and Kiali.
Configure and manage Nginx Controller and Seesaw load-balancing.
Use Terraform for infrastructure-as-code and Docker for containerization.
Manage Kubernetes storage using Portworx.
Automate repetitive operational tasks and contribute to platform stability and efficiency.
Provide support across all US time zones, including rotational shifts, weekends, and occasional 24/7 escalations.

Required Skills & Qualifications:

Extensive experience in incident response, troubleshooting, performance engineering, and service reliability.
Ability to automate manual operational tasks.
Strong understanding of monitoring, alerting, and observability practices.
Java (Proficient) – Must be hands-on in building, supporting, and optimizing Java-based systems and microservices.
Kubernetes (Hands-on) – Deployment, autoscaling, federation, ingress, storage, service mesh, and cluster operations.
Azure (Highly Proficient) – Strong experience across Azure compute, networking, storage, DevOps, and security features.
Functional languages proficiency: Prolog, Haskell, OCaml.
Knowledge of Linux/Unix internals and BASH scripting.
Strong experience with Python or Go.
VMware and virtualization technologies.
Kafka ecosystem tools: Kafka Stream Generator, KSQLDB, Spark Streams.
Experience with Istio/Anthos Service Mesh.
Familiarity with eBPF for low-level observability.
Monitoring tools: Splunk, Prometheus, Datadog, Kiali.
Load balancing with Nginx Controller and Seesaw.
Docker and Terraform expertise.
Experience working with Portworx for Kubernetes storage.

Certification Required:

Azure
Kubernetes

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share