Senior SRE Engineer

Overview

On Site
DOE
Contract - W2

Skills

Scalability
IaaS
System Monitoring
Operational Excellence
Splunk
Incident Management
KPI
Collaboration
Reliability Engineering
IT Infrastructure
Amazon Web Services
Google Cloud Platform
Google Cloud
Software Architecture
Grafana
ServiceNow
Dashboard
Management
Leadership
Communication
Problem Solving
Conflict Resolution
Microsoft Azure
ITIL
Cloud Computing

Job Details


; Job Description:
We are seeking a highly skilled and experienced Senior Site Reliability Engineering (SRE) Engineer to lead our SRE team in ensuring the reliability, scalability, and performance of our production systems.The ideal candidate will have a strong background in cloud infrastructure, automation, and system monitoring, with excellent leadership and communication skills to collaborate across teams and foster a culture of operational excellence


; Responsibilities:
Design and develop enterprise-grade APIs and configuration solutions.Contribute to enterprise and application architecture design.Lead observability initiatives including monitoring, alerting, and incident response.Build and maintain dashboards and alerting systems using Grafana, Prometheus, Splunk, etc.Create and maintain detailed runbooks for operational procedures and incident handling.Define and monitor SLAs, SLOs, and KPIs for critical services.Collaborate with architecture, development, and security teams to ensure system reliability.Evaluate and adopt new technologies to improve system performance and maintainability.


; Requirements:
Strong background in IT infrastructure, cloud platforms (AWS, Azure, Google Cloud Platform), and SRE practices.Experience in enterprise and application architecture.Proven experience in building APIs and backend services.Hands-on experience with tools:Monitoring & Observability: Grafana, Prometheus, SplunkITSM & Operations: ServiceNow, OpsRampProject & Incident Tracking: JIRAExperience in building alerts, dashboards, and operational runbooks.Experience managing distributed systems and large-scale production environments.Strong leadership, communication, and problem-solving skills.Ability to quickly learn and adapt to new technologies and environments.


; Preferred, but not required:
Exposure to OpenShift and Azure cloud platforms.Certifications: SRE Foundation, ITIL, or relevant cloud certifications.

; Education:
Bachelors Degree

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.