Apply Now

Site Reliability Engineer

Irving, TX, US • Posted 1 day ago • Updated 1 day ago

Contract Corp To Corp

Contract W2

No Travel Required

On-site

Depends on Experience

Fitment

Dice Job Match Score™

📊 Calculating match score...

Job Details

Skills

Analytical Skill
Reliability Engineering
Risk Management
Project Management
Recruiting
Regulatory Compliance
Preventive Maintenance
Onboarding
Orchestration
Organizational Skills
Management
Mentorship
Microsoft Azure
Information Systems
Investments
GitHub
HIPAA
Health Care
Performance Management
Microsoft Windows Administration
IT Management
Java
Kubernetes
Docker
Dynatrace
Continuous Delivery
Continuous Improvement
Continuous Integration
Computer Science
Configuration Management
Conflict Resolution
Dashboard
Leadership
Disaster Recovery
Engineering Support
Evaluation
FOCUS
Incident Management
Bash
Budget
Capacity Management
Cloud Computing
Collaboration
Communication
Linux
Multitasking
Offshoring
ProVision
Attention To Detail
Problem Solving
Production Support
Ansible
Python
ROOT
Reporting
Scalability
Scripting
Software Engineering
System Administration
Systems Architecture

Summary

Irving, TX

Work schedule (days & times) - M-F 8 AM to 5 PM PST

Site Reliability Engineer Position Description
The Site Reliability Engineer designs, enhances, and operates highly reliable, scalable, and observable production systems in an Azure-based environment. This role blends software engineering with systems administration to build resilient infrastructure, automate operations, and improve system performance. The engineer applies strong engineering principles to operational challenges with a focus on reliability, automation, observability, and continuous improvement.
Core responsibilities include engineering led incident response, implementing permanent corrective actions, reducing operational toil, and proactively preventing failures. The role contributes to code fixes, owns Dynatrace based observability, and delivers custom reliability and operational reporting to improve system health and availability. Participation in a scheduled-on call rotation is required. Minimum Requirement

4-year Computer Science, Information Systems, Engineering degree or relevant experience.
7+ Years of Site reliability experience.
8+ Years of overall experience.

Key Responsibilities

Design, implement, and maintain monitoring solutions to ensure system health and performance.
Develop and manage CI/CD pipelines using GitHub Actions.
Deploy, manage, and troubleshoot containerized applications using Docker and Kubernetes.
Support and optimize Java-based applications in production environments.
Collaborate with development teams to improve system reliability and reduce operational toil.
Implement best practices for incident response, capacity planning, and disaster recovery.
Provision and manage infrastructure using Azure cloud services.
Improve system observability using tools such as Dynatrace (preferred).
Perform Linux and Windows system administration, including patching, configuration, and troubleshooting.
Automate operational tasks using Ansible, Python, Bash, or similar tools.

Advanced SRE Leadership Responsibilities

Provide technical leadership for SRE practices across multiple services or platforms.
Define and evolve reliability standards, operational best practices, and incident response frameworks.
Influence system architecture and design decisions to ensure scalability, resilience, and operability.
Serve as a subject matter expert for reliability, availability, and production risk management.
Act as the lead escalation point for complex and business critical production incidents.
Lead high severity incident response, coordinating across engineering, platform, and security teams.
Drive blameless post incident reviews and ensure corrective actions are prioritized and completed.
Improve call processes, escalation models, and incident response effectiveness.
Own the strategy and implementation of Dynatrace based observability, including dashboards and alerting standards.
Establish and monitor reliability signals (availability, latency, error rates) across critical systems.
Identify reliability risks and lead mitigation initiatives before customer impact occurs.
Define and maintain leadership level reliability and operational reporting.
Use production data to drive prioritization of reliability investments and operational improvements.
Communicate reliability posture, risks, and recommendations to senior engineering leadership.
Mentor and guide senior and midlevel SREs and production support engineers.
Support hiring, onboarding, and technical evaluation of SRE talent.
Collaborate with squad members to define iteration plans and commitments.
Ensure compliance with HIPAA and other security regulations.

Critical Skills

Strong experience with monitoring and observability tools (Dynatrace experience is a plus).
Hands-on experience with GitHub Actions for CI/CD automation.
Proficiency in Kubernetes and Docker for container orchestration.
Familiarity with Azure cloud services.
Experience with Ansible.
Demonstrated experience in automation of infrastructure and operational processes using scripting or configuration management tools.
Experience supporting Java applications in production.
Solid understanding of Linux and Windows system administration.
Knowledge of SRE principles (SLIs, SLOs, error budgets).

Additional Skills

Experience working with onsite and offshore teams.
Strong communication skills (written and verbal).
Strong organizational skills, attention to detail, and ability to multitask.
Experience in healthcare software or compliance solutions is a plus.
Strong analytical and problem-solving skills with the ability to identify root causes and propose effective solutions.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10311293
Position Id: 26-00024
Posted 1 day ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Hybrid in Irving, Texas

•

Today

Overview: Our client, a Global Fortune 50 organization and one of the world s largest distributors of healthcare systems, medical supplies & pharmaceutical products, seeks an accomplished Senior Site Reliability Engineer. Candidate must be authorized to work in USA without requiring sponsorship Location: San Francisco CA or Irving TX 75039 (Largely Remote) Duration: 6 weeks contract w/ possibility of extension or conversion to FTE role Notes: While this position is primarily remote, occasiona

Easy Apply

Contract

Depends on Experience

Sr. Site Reliability Engineer

Hybrid in Dallas, Texas

•

10d ago

Job Title: Sr. Site Reliability Engineer Location: Dallas, TX or Denver, CO or San Francisco, CA Onsite: 3x days/week Travel: Max 10% Job Description As a Site Reliability Engineer in the Core Services team, you will play a key role in ensuring the reliability, scalability, and performance of Our client's Backbone Network including hardware, software and our toolset used to configure/monitor the environment while adhering to DevOps best practices. Responsibilities Ensure reliability, scala

Easy Apply

Contract

Depends on Experience

Site Reliability Engineer

Hybrid in Irving, Texas

•

26d ago

Required Qualifications: MUST HAVE Platform Ownership & Reliability (SRE): Support endtoend reliability, availability, and performance of the Harness CD platform across nonprod, prod, and BCP environmentsMaintain and report on SLIs, SLOs, error budgets, deployment success rates, and platform health metricsLead incident response, troubleshooting, and RCA for deployment failures, delegate outages, or platform performance issuesIdentify and remediate scaling, performance, and capacity constraints a

Easy Apply

Contract

55 - 65

Site reliability Production support Engineer

Dallas, Texas

•

9d ago

HighLevel Requirements 57 years of relevant experience, primarily focused on operations support; administrative support experience is also considered highly desirable.Strong expertise in Microservices architecture, with practical experience designing, deploying, and supporting distributed systems in production environments.Deep handson knowledge of Kubernetes, including deployment management, scaling, upgrades, troubleshooting, and cluster operations, with a strong focus on reliability, resilien

Easy Apply

Contract

Search all similar jobs

Site Reliability Engineer

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs