Site Reliability Engineer (SRE) Production Support

Overview

On Site
Depends on Experience
Contract - Independent
Contract - W2
Contract - 12 Month(s)

Skills

Production Support
Amazon Web Services
Cloud Computing
Google Cloud Platform
Reliability Engineering

Job Details

Job Title: Site Reliability Engineer (SRE) Production Support
Location: Bellevue, WA

Role Overview:
We are looking for a highly motivated Site Reliability Engineer (SRE) with a strong background in production support.
The ideal candidate will possess a proactive SRE mindset, excellent communication skills, and deep technical expertise to ensure system reliability, performance, and scalability across complex infrastructure.

Key Responsibilities:
Drive proactive monitoring and issue detection using observability tools to minimize Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
Serve as Incident Commander during critical incidents lead triage calls, engage with cross-functional teams (Engineering, Product, Tier 2 SREs), and ensure timely resolution in alignment with SLAs/OLAs.
Analyze system health by correlating data from dashboards, logs, and telemetry tools; recommend preventive and corrective actions.
Participate in continuous improvement efforts including self-healing, automation, and root cause analysis.
Maintain readiness and flexibility to support a 24x7 production environment, including weekend or after-hours coverage as needed.

Required Skills & Expertise:
SRE & Observability:
Strong SRE mindset with experience in monitoring, alerting, and observability best practices.
Hands-on with tools like Splunk, Splunk APM, Splunk O11y, AppDynamics, Grafana, RedMetrics, ThousandEyes.
Exposure to UEM (User Experience Monitoring) and synthetic monitoring tools.

Technical Proficiency:
Solid understanding of infrastructure components: VMs, Load Balancers, Firewalls, API Gateways, Databases, Linux/Unix systems.
Experience with containerization and orchestration tools: Docker, Kubernetes.
Cloud platform knowledge: AWS, Google Cloud Platform, PCF.
Familiarity with tools like NMON, Wireshark for system and network performance diagnostics.

ITSM & Automation:
Working knowledge of ServiceNow, especially AIOps, automated playbooks, and self-healing mechanisms.

Soft Skills:
Exceptional communication and coordination skills, able to engage effectively with stakeholders at Director/Sr. Director level and above.
Strong analytical and troubleshooting abilities, capable of handling high-pressure scenarios with clarity and decisiveness.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.