Site Reliability Engineer

Irving, TX, US • Posted 1 day ago • Updated 1 day ago
Contract Corp To Corp
Contract W2
No Travel Required
On-site
Depends on Experience
Fitment

Dice Job Match Score™

📊 Calculating match score...

Job Details

Skills

  • Analytical Skill
  • Reliability Engineering
  • Risk Management
  • Project Management
  • Recruiting
  • Regulatory Compliance
  • Preventive Maintenance
  • Onboarding
  • Orchestration
  • Organizational Skills
  • Management
  • Mentorship
  • Microsoft Azure
  • Information Systems
  • Investments
  • GitHub
  • HIPAA
  • Health Care
  • Performance Management
  • Microsoft Windows Administration
  • IT Management
  • Java
  • Kubernetes
  • Docker
  • Dynatrace
  • Continuous Delivery
  • Continuous Improvement
  • Continuous Integration
  • Computer Science
  • Configuration Management
  • Conflict Resolution
  • Dashboard
  • Leadership
  • Disaster Recovery
  • Engineering Support
  • Evaluation
  • FOCUS
  • Incident Management
  • Bash
  • Budget
  • Capacity Management
  • Cloud Computing
  • Collaboration
  • Communication
  • Linux
  • Multitasking
  • Offshoring
  • ProVision
  • Attention To Detail
  • Problem Solving
  • Production Support
  • Ansible
  • Python
  • ROOT
  • Reporting
  • Scalability
  • Scripting
  • Software Engineering
  • System Administration
  • Systems Architecture

Summary

Irving, TX

Work schedule (days & times) - M-F 8 AM to 5 PM PST

Site Reliability Engineer Position Description
The Site Reliability Engineer designs, enhances, and operates highly reliable, scalable, and observable production systems in an Azure-based environment. This role blends software engineering with systems administration to build resilient infrastructure, automate operations, and improve system performance. The engineer applies strong engineering principles to operational challenges with a focus on reliability, automation, observability, and continuous improvement.
Core responsibilities include engineering led incident response, implementing permanent corrective actions, reducing operational toil, and proactively preventing failures. The role contributes to code fixes, owns Dynatrace based observability, and delivers custom reliability and operational reporting to improve system health and availability. Participation in a scheduled-on call rotation is required. Minimum Requirement

  • 4-year Computer Science, Information Systems, Engineering degree or relevant experience.
  • 7+ Years of Site reliability experience.
  • 8+ Years of overall experience.

Key Responsibilities

  • Design, implement, and maintain monitoring solutions to ensure system health and performance.
  • Develop and manage CI/CD pipelines using GitHub Actions.
  • Deploy, manage, and troubleshoot containerized applications using Docker and Kubernetes.
  • Support and optimize Java-based applications in production environments.
  • Collaborate with development teams to improve system reliability and reduce operational toil.
  • Implement best practices for incident response, capacity planning, and disaster recovery.
  • Provision and manage infrastructure using Azure cloud services.
  • Improve system observability using tools such as Dynatrace (preferred).
  • Perform Linux and Windows system administration, including patching, configuration, and troubleshooting.
  • Automate operational tasks using Ansible, Python, Bash, or similar tools.

Advanced SRE Leadership Responsibilities

  • Provide technical leadership for SRE practices across multiple services or platforms.
  • Define and evolve reliability standards, operational best practices, and incident response frameworks.
  • Influence system architecture and design decisions to ensure scalability, resilience, and operability.
  • Serve as a subject matter expert for reliability, availability, and production risk management.
  • Act as the lead escalation point for complex and business critical production incidents.
  • Lead high severity incident response, coordinating across engineering, platform, and security teams.
  • Drive blameless post incident reviews and ensure corrective actions are prioritized and completed.
  • Improve call processes, escalation models, and incident response effectiveness.
  • Own the strategy and implementation of Dynatrace based observability, including dashboards and alerting standards.
  • Establish and monitor reliability signals (availability, latency, error rates) across critical systems.
  • Identify reliability risks and lead mitigation initiatives before customer impact occurs.
  • Define and maintain leadership level reliability and operational reporting.
  • Use production data to drive prioritization of reliability investments and operational improvements.
  • Communicate reliability posture, risks, and recommendations to senior engineering leadership.
  • Mentor and guide senior and midlevel SREs and production support engineers.
  • Support hiring, onboarding, and technical evaluation of SRE talent.
  • Collaborate with squad members to define iteration plans and commitments.
  • Ensure compliance with HIPAA and other security regulations.

Critical Skills

  • Strong experience with monitoring and observability tools (Dynatrace experience is a plus).
  • Hands-on experience with GitHub Actions for CI/CD automation.
  • Proficiency in Kubernetes and Docker for container orchestration.
  • Familiarity with Azure cloud services.
  • Experience with Ansible.
  • Demonstrated experience in automation of infrastructure and operational processes using scripting or configuration management tools.
  • Experience supporting Java applications in production.
  • Solid understanding of Linux and Windows system administration.
  • Knowledge of SRE principles (SLIs, SLOs, error budgets).

Additional Skills

  • Experience working with onsite and offshore teams.
  • Strong communication skills (written and verbal).
  • Strong organizational skills, attention to detail, and ability to multitask.
  • Experience in healthcare software or compliance solutions is a plus.
  • Strong analytical and problem-solving skills with the ability to identify root causes and propose effective solutions.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 10311293
  • Position Id: 26-00024
  • Posted 1 day ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in Irving, Texas

Today

Easy Apply

Contract

Depends on Experience

Hybrid in Dallas, Texas

10d ago

Easy Apply

Contract

Depends on Experience

Hybrid in Irving, Texas

26d ago

Easy Apply

Contract

55 - 65

Dallas, Texas

9d ago

Easy Apply

Contract

50

Search all similar jobs