Site Reliability Engineer (SRE)

Overview

Remote
Depends on Experience
Contract - W2

Skills

Amazon Web Services
Bash
Computer Science
CHAOS
Cloud Computing
Continuous Integration
Docker
Google Cloud Platform
Linux
High Availability
Incident Management
Git
Computer Networking
GitLab
Microservices
Reliability Engineering
Collaboration
Dragon NaturallySpeaking

Job Details

Position: Site Reliability Engineer (SRE)
Experience: 9+ years


About the Role

We are looking for a highly skilled Site Reliability Engineer (SRE) to join our team. The ideal candidate will bridge the gap between development and operations, ensuring our systems are scalable, reliable, and secure. You will be responsible for designing, automating, and monitoring critical infrastructure, improving application performance of our services.


Key Responsibilities

  • Build, maintain, and scale cloud infrastructure (AWS/Azure/Google Cloud Platform) with high availability and resilience.

  • Implement automation and Infrastructure-as-Code (IaC) using tools like Terraform, Ansible, or CloudFormation.

  • Monitor system performance, availability, and reliability using Prometheus, Grafana, ELK, Splunk, or Datadog.

  • Develop CI/CD pipelines (Jenkins, GitHub Actions, Azure DevOps, GitLab CI).

  • Manage incident response, on-call rotations, root cause analysis (RCA), and postmortems.

  • Optimize system reliability, latency, and scalability across distributed systems.

  • Ensure security, compliance, and disaster recovery strategies are in place.

  • Collaborate with DevOps, Developers, and QA teams to ensure efficient release cycles.

  • Drive SLOs, SLIs, and SLAs definition and implementation to measure and improve service health.

  • Troubleshoot production issues across services and infrastructure.


Required Skills & Qualifications

  • Bachelor s degree in Computer Science, Engineering, or equivalent experience.

  • 9+ years of experience in SRE, DevOps, or Cloud Infrastructure roles.

  • Strong expertise in Linux/Unix administration and scripting (Python, Bash, Go, or Shell).

  • Hands-on experience with Kubernetes, Docker, and microservices architectures.

  • Proficiency in cloud platforms (AWS, Azure, or Google Cloud Platform).

  • Experience with observability and monitoring tools (Prometheus, Grafana, ELK, New Relic, Datadog).

  • Familiarity with networking, DNS, load balancers, and CDN technologies.

  • Strong understanding of CI/CD pipelines and Git-based workflows.

  • Experience with incident management, chaos engineering, and resilience testing.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.