SRE Architect

Overview

Remote
Depends on Experience
Full Time
No Travel Required

Skills

SRE
UIM
Prometheus
Grafana
Datadog
New Relic
Splunk
AppDynamics

Job Details

1. Technical Expertise

Deep understanding of SRE principles, SRE model, and DevOps methodologies.

Experience designing highly available, scalable, and resilient distributed systems.

Proficient in architectural design (Microservices, Cloud-native, Event-driven architecture).

Skilled in cloud platforms: Azure, Google Cloud Platform.

Strong knowledge of observability tools: UIM, Prometheus, Grafana, Datadog, New Relic, Splunk, AppDynamics.

2. Framework Design & Governance

Define and validate SLOs, SLIs, SLAs, error budgets, and availability targets.

Design runbooks, escalation policies, and chaos testing frameworks.

Create reusable templates for observability, alerting, and logging.

Ensure compliance and audit readiness.

3. Communication & Cross-Functional Leadership

Collaborate with architects, designers, platform and infra teams.

Document frameworks and lead adoption across teams.

Review designs and validate reliability criteria.

Roles & Responsibilities:

1. Framework & Standardization

Define and maintain the SRE operating model, framework, and onboarding guide.

Create templates and reference architectures for observability, alerting, and runbooks.

Standardize definitions of availability, reliability, latency, and performance.

2. Architectural Integration

Participate in application architecture reviews to validate SRE compliance.

Recommend design patterns for fault tolerance, failover, auto-scaling, and DR.

Define observability-by-design principles.

3. Governance, Audit & Optimization

Establish and lead SRE councils or review boards.

Define SRE maturity models, scorecards, and compliance checks.

Perform SRE audits across product portfolios.

Guide teams on capacity modeling, load distribution, and cost-efficiency strategies.

Collaborate with platform teams on resource reservations and right-sizing.

4. Tool Rationalization & Strategy

Evaluate and recommend standard SRE toolchains for monitoring, logging, tracing.

Own the integration strategy across observability platforms.

5. Training, Leadership & Evangelism

Conduct SRE bootcamps for application and infra teams.

Champion a blameless culture and continuous improvement mindset.

Drive Error Budget policies and reliability trade-off discussions.

Mentor product teams on SRE integration strategies.

Influence architectural decisions with SRE perspectives.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

About Stanley David and Associates