Site reliability engineer

Overview

On Site

Depends on Experience

Contract - W2

Skills

DevOps

Amazon Web Services

Authentication

Google Cloud

Microservices

Microsoft Azure

JIRA

Kubernetes

Docker

Dashboard

Continuous Integration

Google Cloud Platform

Orchestration

Python

Scalability

Scripting

ServiceNow

Splunk

Terraform

Virtual Machines

Continuous Monitoring

Cloud Computing

Environment Management

Grafana

IT Service Management

Artificial Intelligence

Job Details

Responsibilities:

Extensive experience with IT infrastructure, cloud platforms (AWS, Azure, Google Cloud Platform), and modern DevOps/SRE methodologies.
Hands-on expertise with monitoring and observability tools: Grafana, Prometheus, Splunk.
Familiarity with ITSM and operational tools such as ServiceNow and OpsRamp.
Experience with project and incident tracking tools like JIRA.
Proficiency in scripting and automation using Python, Bash, Terraform, Ansible.
Strong understanding of CI/CD pipelines, containerization (Docker), and container orchestration (Kubernetes).
Performs environment management, automated server provisioning, pipeline configuration (VMs).
Delivers software to improve the availability, scalability, latency, and efficiency of Client services.
Creates, manages, and uses dashboard for continuous monitoring and health check of applications, and the underlying infrastructure, improve the quality of services using the monitoring feedback for nonproduction environment.
Contributes in future improvement of software delivery processes and operations, e.g., cloud enablement, use of microservices with containerization.
Integrate Dynatrace with CI/CD pipelines, alerting tools, ITSM systems, and incident automation frameworks.
Tune alert thresholds, baselines, and AI-driven anomaly detection to reduce noise and improve actionable insights.
Deeper understanding of Login authentication mechanisms using Ping, ForgeRock and SiteMinder technologies (session management and cookie management)
Correlation mechanisms and dashboards to have end to end visibility of requests from external to internal applications.
Evangelize SRE evolution within IT operations and promoting a culture of engineering excellence and best practices.
Define best practices and principles for SRE, including incident management, monitoring, alerting, and automation.
Collaborate with development teams on resiliency to ensure that services and applications are designed with operational reliability in mind.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share