Site Reliability Engineer (SRE)

Remote • Posted 1 day ago • Updated 1 day ago

Full Time

No Travel Required

Remote

Depends on Experience

Fitment

Dice Job Match Score™

✨ Finding the perfect fit...

Job Details

Skills

Capacity Management
DevOps
Disaster Recovery
Microsoft Azure
Performance Tuning
Management
Continuous Integration
Continuous Delivery
SaaS
Scalability

Summary

***************100% REMOTE*****************

Position Overview

The Site Reliability Engineer plays a key role in maintaining the stability, scalability, and overall health of cloudbased SaaS systems. This position blends software engineering practices with deep operational expertise to create resilient services, streamline processes through automation, and strengthen the platform s reliability posture.

This role is ideal for someone who enjoys solving complex production challenges, reducing operational friction through automation, and taking ownership of system performance and uptime. The SRE will collaborate closely with Engineering, QA, and Product teams to enhance deployment quality, observability, and operational maturity across the organization.

Core Responsibilities

Platform Reliability & Operations

Define and track service-level objectives and indicators to measure system health.
Build and refine observability, alerting, and monitoring frameworks.
Participate in incident response and continuously improve response processes.
Lead root cause analysis efforts and implement long-term corrective actions.
Strengthen system redundancy, resilience, and disaster recovery capabilities.

Automation & Infrastructure Engineering

Develop automation to eliminate repetitive operational tasks and reduce toil.
Enhance CI/CD pipelines, deployment workflows, and release reliability.
Partner with development teams to ensure new features meet production-readiness standards.
Manage and optimize Azure-based infrastructure for performance and cost efficiency.
Apply infrastructure-as-code and configuration management best practices.

Performance & Scalability

Conduct performance benchmarking and capacity planning.
Identify and remediate performance bottlenecks across applications and databases.
Improve horizontal scaling strategies and fault-tolerant architecture patterns.

Governance, Security & Documentation

Ensure infrastructure and deployment processes follow secure-by-design principles.
Support compliance and audit activities related to availability and security.
Maintain clear documentation for architecture, runbooks, and operational procedures.
Collaborate with security teams on access controls and monitoring enhancements.

Qualifications & Experience

Success Indicators

Increased system uptime and reliability.
Reduced MTTR and fewer recurring production issues.
Higher deployment success rates with fewer rollbacks.
Demonstrable reduction in manual operational work through automation.

Required Skills

Strong foundation in distributed systems, cloud operations, and production engineering.
Hands-on experience with Azure in enterprise SaaS environments.
Proficiency in scripting/automation (PowerShell, Python, Bash, etc.).
Experience with CI/CD pipelines and infrastructure-as-code tools.
Familiarity with observability platforms such as Application Insights, Datadog, Prometheus, or Grafana.
Background in incident management and root cause analysis.
Solid understanding of networking, databases, and performance tuning.
Strong communication and cross-functional collaboration abilities.

Experience & Education

5+ years in SRE, DevOps, or similar production engineering roles.
Experience supporting large-scale SaaS applications in production.
Bachelor s degree in Computer Science, Engineering, or a related field.

Preferred Certifications

Azure Administrator or Azure Solutions Architect certifications.
ITIL or formal training in incident management.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10382761
Position Id: 8897885
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.