Job Summary:
Summary We are seeking three SRE / DevOps Engineers to improve the reliability, observability, and operational Role readiness of business-critical platforms and services supporting the Helix program.
While titled as SRE / DevOps, these roles are heavily operations-oriented and require strong production support, incident response, and Splunk-based monitoring experience.
Key Responsibilities
• Lead complex initiatives to improve the reliability, availability, and operational readiness of business-critical platforms and services.
• Own and support production operations, including implementation support, system health monitoring, and proactive issue identification.
• Play a key role in Incident Management, including triage, coordination, root cause analysis, and driving post-incident remediation.
• Support and participate in Business Continuity Planning activities, including failover readiness, disaster recovery testing, and recovery validation.
• Design, implement, and maintain monitoring, alerting, and observability solutions, with a strong emphasis on Splunk-based logging and dashboards.
• Automate operational workflows to reduce manual effort and improve mean time to detect and mean time to recover.
• Partner with application, platform, and security teams to ensure services are built and deployed with operational excellence and reliability in mind.
• Define and enforce SRE and DevOps standards, including SLIs/SLOs, alert hygiene, runbooks, and on-call best practices.
• Lead and participate in post-incident reviews, ensuring root causes are addressed and preventive actions are implemented.
• Mentor engineers on reliability engineering, incident response, and operational best practices.
• Continuously evaluate and improve system performance, resiliency, and operational tooling across the platform lifecycle. Additional Role Context
• These roles are more operations-heavy than a traditional engineering-focused SRE title may suggest.
• Strong Splunk experience is required, including dashboard creation, query development, log investigation, and trace-based troubleshooting across connected systems.
• The team needs people who can navigate issues across integrated systems involved in the Helix VM process, including front-end and infrastructure-connected services.
• The role supports a high-volume change environment, including multiple CR implementations in a single evening and operational coordination across workstreams.
• Candidates should be informed up front that the role may require after-hours deployments, night support, and possible weekend work tied to CRs and future BCP events. CR activity typically begins around 9 PM ET.
• Dallas is the preferred location for these operations resources to support onboarding and collaboration with the existing local team, though strong candidates outside Dallas may still be considered.
• Engagement is expected through the end of the year, based on project demand.