Senior Site Reliability Engineering

Hybrid in Orlando, FL, US • Posted 11 hours ago • Updated 11 hours ago
Full Time
No Travel Required
On-site
Depends on Experience
Fitment

Dice Job Match Score™

👤 Reviewing your profile...

Job Details

Skills

  • Amazon Web Services
  • Reliability Engineering
  • Kubernetes
  • Grafana
  • Disaster Recovery
  • Performance Tuning
  • Incident Management
  • Cloud Computing
  • Nobl9
  • Root Cause Analysis
  • CSI
  • Akamai

Summary

Senior Site Reliability Engineer (SRE)

Orlando, FL ( Hybrid)

Full Time 

Overview

We are looking for an experienced Site Reliability Engineer (8–10 years) to enhance system reliability, observability, and resilience for large-scale, business-critical applications. The role focuses on incident response, automation, performance optimization, and proactive reliability engineering.


Key Responsibilities

Reliability & Incident Management

  • Improve system reliability, availability, and performance through proactive risk identification.
  • Manage production incidents, perform RCA, and eliminate recurring issues.
  • Drive reduction in MTTD and MTTR.
  • Define and monitor SLIs, SLOs, and error budgets (Nobl9).
  • Build reliability dashboards, scorecards, and governance processes.

Observability & Monitoring

  • Enhance observability using OpenTelemetry, Grafana Cloud, AppDynamics, and Splunk.
  • Improve alert quality, reduce noise, and ensure actionable monitoring.
  • Manage logs, metrics, traces (MELT) and AI-driven anomaly detection.

Cloud, Kubernetes & Infrastructure

  • Review AWS architectures using Well-Architected Framework.
  • Ensure high availability, disaster recovery, and multi-AZ/multi-region resilience.
  • Assess Kubernetes clusters (autoscaling, probes, resource tuning, network policies).
  • Identify and eliminate single points of failure across systems.

Performance & Resilience Engineering

  • Analyze latency, scalability, and dependency bottlenecks.
  • Validate system capacity for peak and future load.
  • Review CDN (Akamai), caching, routing, and failover strategies.
  • Support performance testing with engineering teams.

Chaos Engineering & Reliability Testing

  • Design and execute chaos experiments (Gremlin / Harness).
  • Simulate failures across infrastructure, network, and application layers.
  • Establish resilience baselines and drive improvements.

Automation & Self-Healing

  • Automate operational tasks and reduce manual intervention.
  • Build self-healing systems for recovery, scaling, and remediation.
  • Continuously reduce operational toil.

Required Skills

  • 8–10 years in Site Reliability Engineering.
  • Strong experience in incident management, RCA, and CSI tools.
  • SLO/SLI management using Nobl9.
  • Observability tools: OpenTelemetry, Grafana Cloud, AppDynamics, Splunk.
  • Chaos engineering tools: Gremlin or Harness.
  • Kubernetes, AWS (Well-Architected Framework).
  • Experience with Akamai CDN and large-scale distributed systems.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10354711
  • Position Id: 9008122
  • Posted 11 hours ago
Contact the job poster
HT

Hari Thota

Recruiter @ SRI Tech Solutions
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Orlando, Florida

Today

Full-time

No location provided

Today

Full-time

USD 79,200.00 per year

Remote

Today

Full-time

USD 94,882.00 - 136,096.00 per year

Remote

Today

Full-time

USD 175,000.00 - 195,000.00 per year

Search all similar jobs