Apply Now

SRE Architect

Hybrid in Austin, TX, US • Posted 2 days ago • Updated 1 day ago

Full Time

Occasional Travel Required

Hybrid

$140,000 - $160,000/yr

Fitment

Dice Job Match Score™

⭐ Evaluating experience...

Job Details

Skills

dynatrace
AI
AWS
Mlops
Agentic

Summary

Site Reliability Architect (SRE) Unified Observability & AIOps

Role Summary

We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.

Key Responsibilities

Observability & Reliability Engineering

Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
Build actionable dashboards for operations, engineering, and leadership
Implement alerting strategies using static and dynamic thresholds

Proactive Detection & AIOps

Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
Transition monitoring from reactive alerts to proactive insights
Implement noise reduction, alert correlation, and root cause analysis
Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems & Dependency Analysis

Monitor and troubleshoot multi-service architectures involving:

Microservices
Downstream APIs
Kafka / streaming platforms
Cloud infrastructure (Terraform, IaC)

Identify whether issues originate from:

Upstream/downstream dependencies
Streaming platform
Infrastructure
Application code

Tooling & Platforms

Deep hands-on experience with Dynatrace (mandatory)
Experience with:

OpenTelemetry
Prometheus / Grafana
ELK / EFK
Cloud-native monitoring (AWS/Azure/Google Cloud Platform)

Strong JSON-based telemetry manipulation and enrichment

GenAI & LLM Enablement

Apply GenAI / LLMs for:

Incident summarization
Root cause explanation
Runbook recommendations
Auto-remediation suggestions

Collaborate with platform teams to operationalize GenAI safely

Required Skills & Experience

15+ years in SRE / Production Engineering
Strong Unified Observability background (not infra-only)
Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
SLI/SLO engineering experience in production systems
Experience implementing dynamic thresholds and anomaly detection
Knowledge of AI/ML concepts applied to Ops (AIOps)
Distributed systems troubleshooting expertise
Experience with Kafka or streaming data platforms

Differentiators (Highly Valued)

Experience in financial services or regulated environments
Proven reduction of alert noise and MTTR using AIOps
GenAI / LLM integration into operations workflows

Interview Question Bank (Mapped to LPL Expectations)

Dashboards, SLAs, and Reliability Targets

Purpose: Identify true SREs vs dashboard builders

How do you design dashboards differently for engineers vs leadership?
Explain how SLIs and SLOs differ from SLAs. Which do you operationalize?
How do you map SLOs to alerting without creating noise?
What KPIs would you track for a critical trading or advisor-facing platform?

Red Flag: Talks only about CPU, memory, uptime

Alerting Strategy & Threshold Design

Purpose: Assess signal-to-noise maturity

How do you decide when to use static vs dynamic thresholds?
Explain how you prevent alert storms during high traffic or seasonal spikes.
What makes an alert actionable?
How do you design alerts for early symptom detection?

Follow-up

What happens after an alert fires? Walk me through the lifecycle.

Dynamic Thresholds & Anomaly Detection

Purpose: Validate AIOps fundamentals

How do dynamic thresholds work under the hood?
How do you account for baseline drift and seasonality?
What risks do dynamic thresholds introduce?
How would you tune sensitivity to avoid false positives?

Expected Concepts Baselines
ML models
Adaptive learning
Time-series analysis

Multiplexing (Metrics, Signals, Streams)

Purpose: Test system observability depth

What is multiplexing in observability?
How do multiple telemetry signals strengthen diagnosis?
Provide an example where one signal was misleading.
How do you correlate metrics, traces, logs, and events?

JSON Tooling & Proactive Detection

Purpose: Ensure hands-on operational telemetry skills

How have you used JSON-based event payloads to enrich observability?
How do you normalize data across heterogeneous sources?
How do structured logs improve proactive detection?
How do you extract signals from high-volume telemetry?

Proactive vs Reactive Detection

Purpose: Directly aligned to LPL concern

Give an example where you predicted an incident before customer impact.
What indicators help you identify impending failures?
How do you measure the success of proactive detection?

Multi-Service Failure Diagnosis (Critical Question)

Purpose: Core differentiator at LPL

Scenario Question

A user-facing issue is reported. The architecture includes:

Frontend
Backend microservices
Downstream APIs
Kafka streams
Terraform-managed infrastructure

Ask:

How do you determine if the issue is:

Application-related?
Kafka or streaming lag?
Downstream API latency?
Infrastructure drift via Terraform?

Expected Approach Dependency mapping
Golden signals
Trace correlation
Change analysis

Dynatrace (Mandatory)

Purpose: Address explicit gap in feedback

What Dynatrace features have you used most?
How does Davis AI determine root cause?
How do you implement service-level baselining in Dynatrace?
How do you reduce alert noise using Dynatrace?

Red Flag: I ve mostly used dashboards

AI/ML & AIOps Fundamentals

Purpose: Ensure non-theoretical knowledge

What ML techniques are commonly used in AIOps?
How do supervised vs unsupervised models differ in Ops?
Where does AI fail in observability?
How do you validate AI-based decisions?

GenAI & LLM Use Cases for SRE

Purpose: Explicit LPL requirement

Where do you see GenAI adding value in SRE?
Have you used LLMs for incident response?
How would you integrate GenAI without introducing risk?
What data would you restrict from LLM exposure?

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 10476889
Position Id: 8955019
Posted 2 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Austin, Texas

•

Today

{"description": " Job Description: About the Role: We are looking for a Senior SRE to join our Platform Engineering team as the operations owner of our observability platforms. You'll be responsible for the reliability, scalability, and continued evolution of the tools that give our engineering organization visibility into everything they build and run. The current observability platform is primarily comprised of on-premises ELK (Elasticsearch, Logstash, Kibana) Stack and Grafana, with some expo

Full-time

Senior Site Reliability Engineer - Observability

Austin, Texas

•

Today

Job Description: About the Role: We are looking for a Senior SRE to join our Platform Engineering team as the operations owner of our observability platforms. You'll be responsible for the reliability, scalability, and continued evolution of the tools that give our engineering organization visibility into everything they build and run. The current observability platform is primarily comprised of on-premises ELK (Elasticsearch, Logstash, Kibana) Stack and Grafana, with some exposure to New Relic

Full-time

Software Engineer (Site Reliability)

Austin, Texas

•

Today

Description & Requirements Electronic Arts creates next-level entertainment experiences that inspire players and fans around the world. Here, everyone is part of the story. Part of a community that connects across the globe. A place where creativity thrives, new perspectives are invited, and ideas matter. A team where everyone makes play happen. The Challenge Ahead The EA IT Player and Creator Experience team develops platforms and services that support player-facing experiences at scale. Our s

Full-time

Senior Site Reliability Engineer

Austin, Texas

•

Today

{"description": " Job Description: About Dimensional: Dimensional was built around a set of ideas bigger than the firm itself. With a confidence in markets, deep connections to the academic community, and a focus on implementation, we go where the science leads, and continue to pursue new insights, both large and small, that can benefit our clients. The Technology Department at Dimensional leverages the rapidly evolving state of the art to engineer scalable, innovative, and research driven solut

Full-time

Search all similar jobs

SRE Architect

Dice Job Match Score™

Job Details

Skills

Summary

Similar Jobs