Site Reliability Engineer (SRE)

Remote • Posted 1 hour ago • Updated 1 hour ago
Contract W2
Contract Independent
No Travel Required
Remote
$70 - $80/hr
Fitment

Dice Job Match Score™

🎯 Assessing qualifications...

Job Details

Skills

  • Continuous Delivery
  • Capacity Management
  • Cloud Computing
  • Amazon Web Services
  • Budget
  • CHAOS

Summary

Job Overview

We are seeking a highly skilled Site Reliability Engineer (SRE) to join our engineering team and help ensure the reliability, scalability, and performance of our production systems. In this role, you will work closely with software engineers, cloud architects, and DevOps teams to build automated infrastructure solutions, improve system observability, and maintain highly available distributed systems.

The ideal candidate has a strong background in cloud infrastructure, distributed systems, automation, and monitoring tools, along with experience managing large-scale production environments.

This position is fully remote within the United States and offers the opportunity to work on modern cloud-native platforms and highly scalable applications.


Key Responsibilities

System Reliability & Performance

  • Ensure the availability, reliability, and performance of mission-critical production systems.

  • Monitor infrastructure, applications, and services using advanced observability and monitoring tools.

  • Analyze system performance metrics and implement improvements to reduce latency and downtime.

  • Perform capacity planning to support growing system demands.

Automation & Infrastructure

  • Develop and maintain automation tools to reduce manual operational tasks.

  • Build and manage Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation.

  • Automate deployment processes and operational workflows.

Incident Management

  • Participate in on-call rotation to respond to system incidents and outages.

  • Conduct root cause analysis (RCA) and implement long-term solutions to prevent recurring issues.

  • Develop incident response playbooks and improve operational processes.

Monitoring & Observability

  • Implement and maintain monitoring systems such as Prometheus, Grafana, ELK Stack, or Datadog.

  • Create dashboards and alerts to ensure proactive monitoring of production environments.

  • Improve system visibility through logging, metrics, and tracing.

Collaboration & Engineering Support

  • Work closely with development teams to improve application reliability and deployment practices.

  • Assist engineering teams with production deployments and troubleshooting.

  • Advocate for SRE best practices, including error budgets, SLIs, and SLOs.


Required Qualifications

  • Bachelor’s degree in Computer Science, Software Engineering, or a related technical field (or equivalent practical experience).

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.

  • Strong experience with Linux-based systems administration.

  • Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud Platform (Google Cloud Platform).

  • Strong scripting or programming experience in Python, Go, or Bash.

  • Experience managing containerized environments using Docker and Kubernetes.


Core Technical Skills

  • Kubernetes & container orchestration

  • Cloud platforms (AWS / Azure / Google Cloud Platform)

  • Infrastructure as Code (Terraform, CloudFormation)

  • Monitoring tools (Prometheus, Grafana, Datadog, New Relic)

  • Logging tools (ELK Stack, Splunk)

  • CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI)

  • Distributed systems architecture


Preferred / Nice-to-Have Skills

  • Experience with microservices architecture

  • Knowledge of service mesh technologies (Istio, Linkerd)

  • Experience implementing chaos engineering practices

  • Familiarity with security best practices in cloud infrastructure

  • Experience with high-availability and disaster recovery architectures


Work Environment

  • Fully remote work environment across the United States

  • Collaborative engineering culture with cross-functional teams

  • Opportunity to work on large-scale distributed cloud systems

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91172806
  • Position Id: 8914465
  • Posted 1 hour ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote

Today

Contract

60 - 75

Remote

Today

Contract

80 - 90

Remote

Today

Contract

80 - 100

Remote

Today

Contract

60 - 70

Search all similar jobs