Site Reliability Engineer (SRE) Leader/Architect - Full Time

Overview

Remote
Depends on Experience
Full Time

Skills

Site Reliability Engineer

Job Details

JOB Title: Site Reliability Engineer (SRE) Leader

Location: 100% Remote

Duration: Full Time

About the Role - We are seeking a SRE Leader to architect and drive Site Reliability Engineering strategy across the financial services customers.

Key Responsibilities Include -

Strategic Leadership

  • Define and execute the SRE strategy aligned with business goals and engineering priorities.
  • Establish and evangelize SRE principles, best practices, and culture across engineering and product teams.
  • Drive adoption of reliability-focused design patterns, automation, and observability across the organization.
  • Ability to work with multiple stakeholders (Developers, Architects, Operations, Business teams) to define and adopt reliability engineering

Build-Side Reliability Initiatives

  • Embed SRE practices early in the software development lifecycle to ensure reliability is designed into systems from the start.
  • Partner with development teams to implement shift-left reliability testing, including automated resilience and chaos tests in CI/CD pipelines.
  • Define golden paths for developers with pre-built reliability patterns, templates, and infrastructure-as-code modules.
  • Drive build-time observability by integrating telemetry, logging, and tracing into application code during development.
  • Champion performance benchmarking and capacity modeling during build phases to prevent scalability issues post-deployment.
  • Collaborate with architects to enforce reliability-driven design reviews before major releases.

Operational Excellence

  • Own the reliability, availability, and performance of critical services and infrastructure.
  • Lead incident management, root cause analysis, and postmortem processes with a focus on continuous improvement.
  • Develop and monitor SLAs, SLOs, and SLIs to ensure service health and customer satisfaction.

Team Building & Mentorship

  • Build, mentor, and scale a world-class SRE team with a focus on diversity, inclusion, and growth.
  • Foster a culture of ownership, accountability, and innovation within the team.
  • Collaborate with engineering, product, and business stakeholders to align reliability goals with product roadmaps.

Tooling & Automation

  • Drive automation of operational tasks, deployments, and incident response.
  • Lead efforts in observability, monitoring, alerting, and capacity planning.
  • Evaluate and implement modern SRE tools and platforms to improve efficiency and reduce toil.

Governance & Compliance

  • Ensure compliance with security, privacy, and regulatory requirements in all reliability practices.
  • Establish governance frameworks for change management, risk mitigation, and service continuity.

Qualifications

  • 15-20 years of experience
  • 8+ years in leadership roles managing large-scale SRE Programs
  • Deep understanding of cloud-native architectures (AWS, Azure, Google Cloud Platform), microservices, and distributed systems.
  • Proficiency in using Application Performance Monitoring (APM) tool New Relic/Dynatrace for monitoring, logging, tracing and Splunk for Log monitoring.
  • Expertise in observability tools (e.g., Prometheus, Grafana, Datadog), CI/CD pipelines, and infrastructure as code (Terraform, Ansible).
  • Strong experience with incident response, chaos engineering, and reliability testing.
  • Proven ability to influence cross-functional teams and drive organizational change.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Visionary Innovative Technology Solutions