Site Reliability Engineer(SRE)

Overview

Hybrid

Depends on Experience

Contract - W2

Contract - 18 Month(s)

Skills

Site Reliability Engineer

SRE

Datadog

Kubernetes

AWS

EKS

On-call

Job Details

Job Title: Site Reliability Engineer
Location: Westlake, TX / Merrimack, NH (Hybrid)
Duration: Long Term Contract

Shift Details:

On-call: 10 am 8 pm EST (twice per week, one may fall on a weekend)
Non on-call: Monday Friday, 9:00 am 5:00 pm EST

Required Skills

* Datadog
* Kubernetes
* AWS (EKS preferred), Azure (AKS)
* On-call experience running incidents
* Development background with Ansible, Python, Node.js, JavaScript, Jenkins (Groovy scripting)

Role Overview

As a Site Reliability Engineer (SRE), you will be responsible for building, supporting, and scaling reliable and resilient distributed systems. This role combines software engineering and systems engineering to ensure high availability, automation, observability, and performance.

Key Responsibilities

* Design, build, and support highly distributed, multi-tiered systems at scale
* Drive automation using scripting and Infrastructure as Code (IaC) tools
* Implement CI/CD pipelines and DevOps best practices
* Manage Kubernetes clusters and containerized applications
* Apply observability practices including monitoring, alerting, and logging
* Troubleshoot incidents, perform root cause analysis, and ensure system resiliency
* Collaborate with cross-functional engineering teams

Qualifications

* Bachelor s degree in Computer Science, Engineering, or related field (Master s preferred)
* 8+ years of experience deploying and supporting distributed systems
* 2+ years of hands-on Cloud (AWS preferred) development and migration experience
* 2 4 years of experience in software development with Python, Node.js, or Java
* Strong Kubernetes administration and operations experience
* Expertise with monitoring/observability tools (Datadog, Prometheus, Grafana, Splunk, ELK, OpenTelemetry, etc.)
* Experience with Infrastructure as Code (Terraform, IAM, ARM, Chef, etc.)
* Strong troubleshooting, incident response, and communication skills

Nice to Have

* Chaos testing and resiliency engineering experience
* Experience supporting large-scale enterprise platforms
* Exposure to multiple cloud platforms (AWS & Azure)

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Job Details

Share