Site Reliability Engineer (SRE) Full Time Position

Overview

On Site
Full Time

Skills

JAVA
support
production
Site Reliability Engineer (SRE)

Job Details

Job Title: Site Reliability Engineer (SRE)

Location: Austin, TX (5 days Onsite)

Duration: Permanent/ Fulltime

Mainly, they are looking for the candidates with real-time production support experience - the challenges and the ways to overcome issues, rather than just being a DevOps Engineer. The other points will be alerts, monitoring tools, python scripts, etc.

Job Summary

Seasoned Site Reliability Engineer (SRE) with 5+ years of experience in supporting complex, large-scale distributed systems. Highly skilled in managing production failures, conducting root cause analysis, and driving effective remediation. Strong communicator with expertise in ing, monitoring, and release management, complemented by automation proficiency and a keen ability to learn quickly.

This role involves providing 24/7 support as part of the SRE team, ensuring the reliability and performance of mission-critical Java, .NET, and Batch applications deployed across Google Cloud Platform, PCF, and on-premise environments.

Technical Skills:

Expertise in understanding large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management.

Should have solid hands-on experience in troubleshooting and fixing application failures, application Performance degradation, Code issues, cloud platform issues, Batch Failures, Infra failures, DB failures, Network failures.

Hands-on experience in performing Production deployments using CI/CD and exposure to deployment strategies.

Experience in troubleshooting of Linux/Unix.

Monitor the application/Services/batch availability.

Act quickly on the application s(Performance, Availability) and Batch Job failures

Perform the required analysis (Code/Log) and escalate to the Engineering team as required.

Initiate and drive the Techlines in case of outages/major incidents/Batch abends and ensure Service Restoration in the least time possible.

Effectively handle the Incident, Problem, Release and Change management.

Own and deliver the user stories assigned as part of the sprint.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.