Site Reliability Engineer (SRE) with Java Experience

  • San Francisco, CA
  • Posted 5 hours ago | Updated 5 hours ago

Overview

On Site
Hybrid
Depends on Experience
Accepts corp to corp applications
Contract - W2
Contract - 12 Month(s)

Skills

Amazon Web Services
Cloud Computing
Collaboration
Continuous Delivery
Google Cloud Platform
Microsoft Azure
Splunk
Scripting
Shell
Performance Tuning
Kubernetes
Jenkins
Java
Docker
Python
Terraform
Spring Framework

Job Details

California Local Candidates are highly preferred.

Key Responsibilities:

  • Design, build, and maintain scalable, resilient, and highly available production systems.
  • Develop automation solutions using Java, Python, or scripting languages to reduce manual operations.
  • Implement and manage monitoring, logging, and alerting frameworks for proactive issue detection.
  • Collaborate with development teams to ensure reliability, performance, and scalability in application design.
  • Troubleshoot production issues, perform root cause analysis, and implement long-term fixes.
  • Define and measure SLIs, SLOs, and SLAs to ensure system reliability goals.
  • Optimize system performance and capacity planning across distributed environments.
  • Support CI/CD pipelines and ensure smooth deployments with minimal downtime.
  • Champion best practices for incident management, post-mortem analysis, and continuous improvement.
  • Work with cloud-native platforms (AWS, Azure, or Google Cloud Platform) to build resilient infrastructure.

Required Skills and Qualifications:

  • Bachelor s or Master s degree in Computer Science, Engineering, or related field.
  • Strong hands-on coding experience in Java (Spring Boot, Microservices).
  • Proficiency in scripting languages (Python, Shell, or Go preferred).
  • Solid understanding of cloud platforms (AWS, Google Cloud Platform, or Azure).
  • Expertise in Kubernetes, Docker, and container orchestration.
  • Experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar).
  • Strong knowledge of monitoring and observability tools (Prometheus, Grafana, Splunk, ELK, Datadog, New Relic).
  • Experience with infrastructure as code (IaC) tools such as Terraform, Ansible, or CloudFormation.
  • Solid understanding of networking, distributed systems, and performance optimization.
  • Strong problem-solving, analytical, and troubleshooting skills.
  • Excellent communication and collaboration skills to work with cross-functional teams.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.