Apply Now

Site Reliability Engineer Observability & Resilience

Orlando, FL, US • Posted 4 days ago • Updated 3 hours ago

Full Time

On-site

HTC Global Services

Fitment

Dice Job Match Score™

⏳ Almost there, hang tight...

Job Details

Skills

Real-time
Tier 1
Operational Efficiency
Service Level
Continuous Improvement
Reliability Analysis
Roadmaps
Problem Management
MEAN Stack
Service Level Management
Dashboard
Budget
Regulatory Compliance
Optimization
Operational Excellence
HR Management System
Disaster Recovery
Backup
Startups
Database
High Availability
Splunk
Artificial Intelligence
Mapping
Scalability
Performance Engineering
Stress Testing
Routing
Caching
Failover
Testing
Network
Workflow
Recovery
Reliability Engineering
Root Cause Analysis
Management
AppDynamics
Performance Monitoring
Grafana
Cloud Computing
CHAOS
Akamai
Amazon Web Services
Kubernetes
Incident Management
Collaboration
Emerging Technologies
Insurance
Professional Development
Innovation
HTC
Recruiting

Summary

Job Title: Senior Site Reliability Engineer (SRE)

Overview / Summary

We are seeking a Site Reliability Engineer (SRE) with 8 10 years of experience to drive reliability, observability, and resilience improvements across critical systems. This is a high-impact, front-line operations role focused on real-time incident response, proactive prevention, continuous automation, and reliability engineering for Tier-1 business-critical applications.

Key Responsibilities

Drive automation initiatives to improve system performance and operational efficiency.
Improve application reliability and availability by proactively identifying and mitigating risks.
Analyze production incidents and root cause analyses (RCAs) to eliminate recurring issues and reduce outages.
Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets using Nobl9.
Conduct reliability assessments across applications, infrastructure, Kubernetes, databases, networks, caching platforms, and cloud environments.
Drive observability improvements using OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, and monitoring best practices.
Perform performance and scalability reviews to support current and future demand.
Lead chaos engineering exercises using Gremlin or Harness Chaos Engineering.
Review cloud architectures against AWS Well-Architected Framework standards and drive remediation of reliability gaps.
Automate operational tasks and implement self-healing solutions.
Identify and eliminate single points of failure (SPOFs) and strengthen disaster recovery and failover capabilities.
Collaborate with Development, Infrastructure, Performance Engineering, and Operations teams to improve system resilience.
Establish reliability governance, dashboards, runbooks, and continuous improvement processes.

Reliability Assessment & Engineering

Conduct application reliability assessments using established reliability frameworks.
Review historical incidents, Sev-1/Sev-2 RCAs, and recurring failure patterns.
Identify reliability debt and drive remediation initiatives.
Evaluate application readiness for SRE engagement.
Perform end-to-end reliability reviews across application, infrastructure, network, and platform layers.
Define reliability roadmaps and track improvement initiatives.

Incident Management & RCA

Analyze incident trends using CSI or equivalent incident management platforms.
Participate in Major Incident Management and Problem Management processes.
Drive RCA reviews and corrective actions.
Track reliability improvement initiatives resulting from postmortems.
Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).

Service Level Management

Define and implement SLIs.
Establish SLOs and Error Budgets using Nobl9.
Partner with Product and Engineering teams to define business-focused reliability targets.
Build SLO dashboards and reliability scorecards.
Monitor error budget consumption and enforce governance policies.
Conduct reliability reviews based on SLO compliance.

Cloud & Platform Reliability

Review cloud architectures against AWS Well-Architected Framework principles.
Conduct reliability, performance, cost optimization, security, and operational excellence assessments.
Identify High Risk Issues (HRIs) and drive remediation.
Validate high availability, disaster recovery, backup, and failover capabilities.
Ensure multi-AZ and multi-region deployment strategies are implemented where required.

Kubernetes & Infrastructure Reliability

Review Kubernetes cluster health and workload configurations.
Validate resource requests, limits, autoscaling, and resiliency patterns.
Assess readiness, liveness, and startup probes.
Review service mesh configurations, network policies, and traffic routing.
Validate database high availability, caching strategies, and scaling configurations.
Identify and eliminate single points of failure.

Observability & Monitoring

Design and improve enterprise observability strategies.
Implement OpenTelemetry-based telemetry collection.
Manage metrics, events, logs, and traces (MELT).
Integrate telemetry into Grafana Cloud, Splunk Observability, or equivalent platforms.
Utilize AI-driven observability capabilities for anomaly detection and root cause analysis.
Improve alert quality, reduce alert fatigue, and increase actionable monitoring coverage.
Ensure every alert has an owner, runbook, and customer impact justification.

Application Performance Engineering

Conduct dependency mapping and architecture reviews.
Analyze latency, throughput, and scalability bottlenecks.
Review timeout, retry, circuit breaker, and resilience patterns.
Collaborate with Performance Engineering teams on load and stress testing.
Validate system capacity against current and future traffic demands.
Review Akamai CDN configurations, traffic routing, caching, and failover strategies.
Ensure applications can sustain significant traffic spikes and peak loads.

Chaos Engineering & Resilience Testing

Design and execute chaos engineering experiments using Gremlin or Harness Chaos Engineering.
Simulate infrastructure, network, application, and dependency failures.
Validate system behavior during failure scenarios.
Establish reliability score baselines and improvement goals.
Measure resilience against real-world production conditions.
Document findings and implement corrective improvements.

Automation & Self-Healing

Identify repetitive operational tasks suitable for automation.
Develop self-healing workflows for common infrastructure and application failures.
Automate alert remediation, scaling, recovery, and operational activities.
Reduce manual intervention and operational toil.
Improve platform efficiency through engineering-driven automation.

Required Qualifications

8 10 years of experience in Site Reliability Engineering.
Experience with CSI for incident and RCA tracking.
Experience with Nobl9 for SLO management.
Experience with AppDynamics for application performance monitoring.
Experience with OpenTelemetry and Grafana Cloud for telemetry and observability.
Experience with Gremlin or Harness Chaos Engineering.
Experience with Akamai CDN.
Knowledge of AWS Well-Architected Framework.
Experience with Kubernetes reliability, observability, incident management, automation, and resilience engineering.

What Makes HTC A Great Place To Build Your Future

HTC Global Services wants you to join our team. Come build new things with us and advance your career. At HTC Global, you ll collaborate with experts, work alongside clients, and be part of high-performing teams driving success together. You ll have long-term opportunities to grow your career and develop skills in the latest emerging technologies.

At HTC Global Services, our employees have access to a comprehensive benefits package. Benefits can include Group Health (Medical, Dental, and Vision), Paid Time Off, Paid Holidays, 401(k) matching, Group Life and Disability insurance, Professional Development opportunities, Wellness programs, and a variety of other perks.

Our success as a company is built on inclusion and diversity. HTC Global Services is committed to providing a workplace free from discrimination and harassment, where every employee is treated with dignity and respect. We celebrate differences and believe that diverse cultures, perspectives, and skills drive innovation and success. HTC is an Equal Opportunity Employer and a proud National Minority Supplier. We seek to empower each individual, fostering an environment where everyone feels valued, included, and respected.

#LI-ST1 #LI-Hybrid #Hiring

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10122753
Position Id: 243352
Posted 4 days ago

Company Info

About HTC Global Services

HTC Global Services, established in 1990 and headquartered in Troy, Michigan, is a leading global information technology, and business process services company with operations across North America, Europe, Asia Pacific, Middle East, and India. We leverage our expertise in legacy and emerging digital technologies to deliver transformative outcomes for our enviable list of clients, which includes Fortune 1000 companies.

Our new vision Reimagining a better-shared world and mission Bringing human expertise to tech for delivering purposeful solutions that amplify value is at the heart of our transformation approach, powered by cloud, platform mindset, and engagement. Our motto Let s make digital change happen is our commitment to empower our clients to succeed in this digital world. While our values integrity, teamwork, the pursuit of excellence, committed, customer-centric, and thought-leadership, defines our character and behavior.

Mission:
Bring human expertise to tech in order to deliver purposeful solutions that amplify value.

Go to company profile

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Orlando, Florida

•

Today

Job Title: Lead Site Reliability Engineer (Google Cloud Platform & Kubernetes) Overview / Summary We are seeking a Lead Site Reliability Engineer to drive reliability, scalability, and operational excellence across a rapidly growing technology ecosystem. This role serves as a technical leader focused on cloud architecture, Kubernetes platforms, infrastructure automation, and highly available distributed systems. The position plays a key role in defining infrastructure strategy, improving plat

Full-time

Senior CDN Platform Engineer

Dearborn, Michigan

•