Overview
On Site
110,000 - 120,000
Full Time
No Travel Required
Unable to Provide Sponsorship
Skills
Incident Management
Reliability Engineering
Performance Engineering
Java
Kubernetes
Microsoft Azure
DevOps
Prolog
Haskell
Objective Caml
Linux
Scripting
Unix
Bash
Python
VMware
Virtualization
Apache Kafka
Splunk
Docker
Terraform
Azure Certification
Kubernetes Certification
Job Details
Role: Senior Site Reliability Engineer (SRE) – Java / Kubernetes / Azure
Location: Phoenix, AZ (Day 1 Onsite)
Job Type: Full-Time
Pay Rate Range: 105k - 125k/ year
Key Responsibilities:
- Provide senior-level SRE support, ensuring system reliability, availability, and operational excellence across all environments.
- Develop and maintain services and automation scripts using Java as the primary programming language.
- Build, deploy, and optimize workloads running on Kubernetes clusters (including multi-cluster and federated deployments).
- Manage and enhance cloud infrastructure leveraging Azure services and best practices.
- Work with Linux/Unix systems and develop automation using BASH shell scripting.
- Build automation and tooling using Python or Go.
- Design, implement, and maintain CI/CD pipelines using GitLab CI/CD and Jenkins.
- Support application streaming, event processing, and analytics using Kafka Stream Generator, KSQLDB, and Spark Streams.
- Work with service mesh technologies including Istio and understand Anthos Service Mesh.
- Utilize VMware and other virtualization platforms for environment provisioning.
- Provide robust incident support, root-cause analysis, and production issue resolution.
- Implement eBPF-based observability and performance troubleshooting where applicable.
- Develop and enhance monitoring and alerting systems using Splunk, Prometheus, Datadog, and Kiali.
- Configure and manage Nginx Controller and Seesaw load-balancing.
- Use Terraform for infrastructure-as-code and Docker for containerization.
- Manage Kubernetes storage using Portworx.
- Automate repetitive operational tasks and contribute to platform stability and efficiency.
- Provide support across all US time zones, including rotational shifts, weekends, and occasional 24/7 escalations.
Required Skills & Qualifications:
- Extensive experience in incident response, troubleshooting, performance engineering, and service reliability.
- Ability to automate manual operational tasks.
- Strong understanding of monitoring, alerting, and observability practices.
- Java (Proficient) – Must be hands-on in building, supporting, and optimizing Java-based systems and microservices.
- Kubernetes (Hands-on) – Deployment, autoscaling, federation, ingress, storage, service mesh, and cluster operations.
- Azure (Highly Proficient) – Strong experience across Azure compute, networking, storage, DevOps, and security features.
- Functional languages proficiency: Prolog, Haskell, OCaml.
- Knowledge of Linux/Unix internals and BASH scripting.
- Strong experience with Python or Go.
- VMware and virtualization technologies.
- Kafka ecosystem tools: Kafka Stream Generator, KSQLDB, Spark Streams.
- Experience with Istio/Anthos Service Mesh.
- Familiarity with eBPF for low-level observability.
- Monitoring tools: Splunk, Prometheus, Datadog, Kiali.
- Load balancing with Nginx Controller and Seesaw.
- Docker and Terraform expertise.
- Experience working with Portworx for Kubernetes storage.
Certification Required:
- Azure
- Kubernetes
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.