Site Reliability Engineering (SRE) Lead Java & AWS

Overview

On Site
Hybrid
$120,000 - $140,000
Full Time
100% Travel

Skills

Continuous Integration
Amazon RDS
Amazon Web Services
Application Development
Amazon EC2
GitLab
Health Care
Incident Management
Kubernetes
Continuous Delivery
Collaboration
Regulatory Compliance
Reliability Engineering
Remote Desktop Services
Performance Tuning
Virtual Private Cloud
Problem Solving
Management

Job Details

Job Title: Site Reliability Engineering (SRE) Lead Java & AWS Location: Phoenix, AZ (Hybrid/Onsite as required) Type: Full-time Job Summary Valuespectrum is seeking a highly skilled SRE Lead with a strong Java development background and expertise in AWS cloud infrastructure. The ideal candidate will lead a team of SREs to design, implement, and maintain reliable, scalable, and secure systems that power critical business applications. This role blends hands-on coding, automation, and DevOps practices with leadership responsibilities to drive system reliability and performance across the enterprise.

  1. Key Responsibilities Lead and mentor the Site Reliability Engineering team to ensure uptime, reliability, and scalability of production systems. Collaborate with Java engineering teams to build resilient microservices and APIs with reliability in mind. Drive automation for deployments, monitoring, logging, and incident response. Architect and manage AWS infrastructure (EC2, EKS, RDS, Lambda, S3, VPC, CloudFormation/Terraform). Establish and enforce SLOs, SLIs, and SLAs to track and improve service reliability. Lead incident management processes, root cause analysis, and post-mortem reviews. Implement CI/CD pipelines and work closely with development teams for continuous delivery. Champion observability practices using monitoring, logging, and alerting tools (CloudWatch, Prometheus, Grafana, ELK, Datadog, etc.). Ensure cost optimization, security, and compliance in AWS environments. Stay ahead of industry best practices and introduce modern SRE/DevOps methodologies.
  2. Required Skills & Experience 8+ years of IT experience, with at least 4+ years in Site Reliability Engineering or DevOps roles. Strong background in Java/J2EE applications (coding, debugging, performance tuning). Hands-on expertise with AWS cloud infrastructure and services (certifications preferred). Proven experience with Kubernetes, Docker, and container orchestration. Solid knowledge of Terraform, Ansible, or CloudFormation for IaC. Expertise in monitoring/observability (Prometheus, Grafana, ELK, Splunk, Datadog). Strong experience in CI/CD pipelines (Jenkins, GitLab, GitHub Actions, or similar). Excellent leadership, communication, and problem-solving skills.
    Preferred Skills Experience with financial/healthcare systems or high-availability platforms. Knowledge of resiliency testing, chaos engineering, and disaster recovery. AWS Certified Solutions Architect or DevOps Engineer certification. Familiarity with service mesh technologies (Istio, Linkerd).
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Value Spectrum Technologies LLC