Apply Now

Senior Site Reliability Engineer

Coppell, TX, US • Posted 18 hours ago • Updated 18 hours ago

Full Time

No Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

🫥 Flibbertigibetting...

Job Details

Skills

ARM
Amazon Web Services
Ansible
Backup
Bash
Budget
Capacity Management
Cloud Computing
Collaboration
Communication
Computer Networking
Continuous Delivery
Continuous Integration
DevOps
Disaster Recovery
Docker
Failover
Google Cloud
Google Cloud Platform
Grafana
HIPAA
Hardening
High Availability
IaaS
Incident Management
Kubernetes
Linux
Mentorship
Microsoft Azure
Microsoft Windows Administration
Operational Excellence
Performance Tuning
Product Engineering
Productivity
Python
RBAC
RPO
Regulatory Compliance
Reliability Engineering
Root Cause Analysis
SaaS
Scalability
Scripting
Service Level
Splunk
System On A Chip
Terraform
Windows PowerShell

Summary

Senior Site Reliability Engineer — combination of deep operational expertise and hands-on engineering ability. The majority of your time (~70%) will be focused on owning the reliability, availability, scalability, and operational excellence of the cloud infrastructure and SaaS platforms powering our business. The remaining ~30% puts you directly in the platform engineering flow: building automation, improving deployment pipelines, and driving reliability initiatives from conception through production.
You will write and review automation code, contribute to architecture and deployment discussions, and collaborate closely with product engineering teams to ensure operational and reliability decisions are made correctly the first time.

Key Responsibilities

Reliability Engineering & Operations (~40% of role)

Own day-to-day monitoring, alerting, operational health, and on-call support for mission-critical SaaS platforms and cloud infrastructure.

Lead major incident response activities including escalation coordination, root cause analysis, and postmortem reviews.

Design and maintain high-availability, failover, backup, and disaster recovery procedures; validate RTO/RPO targets regularly.

Investigate and resolve production incidents end-to-end across infrastructure, platform, and application layers.

Automation & Platform Engineering (~30% of role)

Design, implement, and maintain Infrastructure as Code (IaC), deployment automation, and CI/CD pipeline improvements.

Develop tooling and automation to reduce operational toil and improve engineering productivity.

Partner with development teams to improve deployment safety, release reliability, and operational scalability.

Drive standardization of cloud infrastructure, operational engineering practices, and deployment governance.

Observability & Performance Optimization (~15% of role)

Build and maintain monitoring, logging, tracing, and alerting capabilities across distributed systems.

Establish service-level objectives (SLOs), SLIs, and error budget policies.

Identify and remediate performance bottlenecks, scaling issues, and infrastructure inefficiencies.

Analyze operational telemetry and trends to improve reliability and capacity planning.

Security, Compliance & Architecture (~15% of role)

Implement operational security best practices including RBAC, least privilege access, and infrastructure hardening.

Ensure compliance with SOC 2, HIPAA, GDPR, and organizational security standards.

Participate in architecture reviews and operational readiness assessments for new services and platforms.

Mentor junior engineers on reliability engineering, cloud operations, automation, and incident management best practices.

Required Qualifications

7+ years of experience in Site Reliability Engineering, DevOps, Cloud Infrastructure, or Production Operations roles.

Strong experience operating workloads in cloud environments such as Microsoft Azure, AWS, or Google Cloud.

Hands-on experience with Kubernetes, Docker, CI/CD pipelines, and Infrastructure as Code tools.

Strong scripting and automation skills using Python, Bash, PowerShell, Go, or similar languages.

Experience with observability and monitoring platforms such as Datadog, Grafana, Prometheus, or Splunk.

Strong understanding of networking, Linux/Windows administration, distributed systems, and cloud-native architectures.

Experience with incident response, production troubleshooting, and operational governance.

Strong communication skills and ability to collaborate across engineering and business teams.

Preferred Qualifications

Experience supporting multi-tenant SaaS environments.

Experience with Terraform, Bicep, ARM templates, or Ansible.

Familiarity with GitOps and modern deployment strategies such as canary or blue/green deployments.

Experience working within regulated or compliance-driven environments.

Relevant cloud or Kubernetes certifications.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 90692381
Position Id: 8963660
Posted 18 hours ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Dallas, Texas

•

Today

Job Description Site Reliability Engineer - Vice President Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run scalable, massively distributed, fault-tolerant systems. At Goldman Sachs, SRE is responsible for improving the availability and reliability of the firm's most critical platform services and ensures they meet the requirements of our internal and external users. It is also responsible for firmwide policies and

Full-time

Site Reliability Engineer III

Plano, Texas

•

Today

Job Description If you are excited about shaping the future of technology and driving significant business impact in financial services, we are looking for people just like you. Join our team and help us develop game-changing, high-quality solutions. As a Site Reliability Engineer at JPMorganChase within the Data Solutions team of Corporate Sector, you will play a key role in automating, troubleshooting, and monitoring AWS-based applications and infrastructure. You will work hands-on to enhan

Full-time

Associate Director Site Reliability Engineering - IAM

Hybrid in Coppell, Texas

•

Today

Are you ready to make an impact at DTCC? Do you want to work on innovative projects, collaborate with a dynamic and supportive team, and receive investment in your professional development? At DTCC, we are at the forefront of innovation in the financial markets. We are committed to helping our employees grow and succeed. We believe that you have the skills and drive to make a real impact. We foster a thriving internal community and are committed to creating a workplace that looks like the world

Full-time

Site Reliability Engineer (SRE)

Plano, Texas

•

Today

Job#: 3031493 Job Description: Site Reliability Engineer (SRE) Location: Plano, Texas (Onsite) Employment Type: Contract Contract Duration: 12 Months Role Overview We are seeking a Site Reliability Engineer (SRE) to operate hands-on across the stack to improve platform and application observability, drive reliability improvements, and deliver measurable gains in operational efficiency. This role will work closely with core teams to execute platform modernization, harden production systems,

Easy Apply

Full-time

USD 73.68 per hour

Search all similar jobs

Senior Site Reliability Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs