Site Reliability Engineer - Onsite role in Wilmington, DE / Phildephia, PA

Overview

On Site
Depends on Experience
Contract - W2
Contract - 12 Month(s)
Able to Provide Sponsorship

Skills

Unix
S2
AWS
Spluk
Observability
ELK

Job Details

Site Reliability Engineer , who thinks systematically about reliability, can translate business requirements into technical implementations, and thrives on making complex systems more robust.

.This individual will supports on-premise applications with Unix and Shell scripting, manages infrastructure using S3 and Terraform, and ensures system reliability across cloud to improve our platform's reliability.

In this role, you will:

  • Work alongside developers as well as the business stakeholders and strive to automate the acceptance criteria
  • Maintain high reliability and availability for software applications
  • Automate the mundane tasks and avoid human errors
  • Define SLI (Service level indicator) & SLO (service level objective) by collaborating with Product owners
  • Supports on-premise applications with Unix and Shell scripting, manages infrastructure using S3 and Terraform, and ensures system reliability across cloud
  • Write incident root cause analysis, find out the core reason behind the issue and prevent it from happening again
  • Document procedures, best practices and troubleshooting FAQs.
  • Debug the system and fixing the production related issues.
  • Escalate / follow-up on permanent fix for development related issues.
  • Handle complex operational tasks and recommends process and technology changes.
  • Provide global support including troubleshooting production related issues and performing checkouts.

Required Qualifications:

5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

5+ years of Site Reliability Engineering experience or related experience

Desired Qualifications:

  • Strong understanding of the REST APIs
  • Strong understanding in working of the troubleshooting tools such as Splunk, AppDynamics, and Elastic APM
  • Strong experience in API Management tools such as Apigee
  • Working knowledge of databases such as MongoDB, Oracle
  • Strong foundation in reliability engineering principles and distributed systems behavior
  • Experience defining and implementing SLOs/SLIs and using them to drive system improvements
  • Demonstrated ability to design and implement observability solutions that provide actionable insights while minimizing alert fatigue
  • Understand modern observability practices and experience implementing and maintaining monitoring solutions such as PrometheGrafana, Splunk, NewRelic, CloudWatch, and ELK in the cloud
  • Strong incident response skills with experience leading incident retrospectives and driving improvements
  • Excellent problem-solving abilities and experience debugging distributed systems
  • Track record of successfully automating operations and reducing toil
  • Strong communication skills with ability to explain complex technical concepts to audiences
  • Ability to work both independently and collaboratively (in groups) in an energetic environment.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Unisoft Technology Inc