Site Reliability Architect (SRE) Unified Observability & AIOps
Role Summary
We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.
Key Responsibilities
Observability & Reliability Engineering
- Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
- Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
- Build actionable dashboards for operations, engineering, and leadership
- Implement alerting strategies using static and dynamic thresholds
Proactive Detection & AIOps
- Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
- Transition monitoring from reactive alerts to proactive insights
- Implement noise reduction, alert correlation, and root cause analysis
- Apply baseline modeling, seasonality detection, and anomaly scoring
Distributed Systems & Dependency Analysis
- Monitor and troubleshoot multi-service architectures involving:
- Microservices
- Downstream APIs
- Kafka / streaming platforms
- Cloud infrastructure (Terraform, IaC)
- Identify whether issues originate from:
- Upstream/downstream dependencies
- Streaming platform
- Infrastructure
- Application code
Tooling & Platforms
- Deep hands-on experience with Dynatrace (mandatory)
- Experience with:
- OpenTelemetry
- Prometheus / Grafana
- ELK / EFK
- Cloud-native monitoring (AWS/Azure/Google Cloud Platform)
- Strong JSON-based telemetry manipulation and enrichment
GenAI & LLM Enablement
- Apply GenAI / LLMs for:
- Incident summarization
- Root cause explanation
- Runbook recommendations
- Auto-remediation suggestions
- Collaborate with platform teams to operationalize GenAI safely
Required Skills & Experience
15+ years in SRE / Production Engineering
Strong Unified Observability background (not infra-only)
Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
SLI/SLO engineering experience in production systems
Experience implementing dynamic thresholds and anomaly detection
Knowledge of AI/ML concepts applied to Ops (AIOps)
Distributed systems troubleshooting expertise
Experience with Kafka or streaming data platforms
Differentiators (Highly Valued)
- Experience in financial services or regulated environments
- Proven reduction of alert noise and MTTR using AIOps
- GenAI / LLM integration into operations workflows
Interview Question Bank (Mapped to LPL Expectations)
- Dashboards, SLAs, and Reliability Targets
Purpose: Identify true SREs vs dashboard builders
- How do you design dashboards differently for engineers vs leadership?
- Explain how SLIs and SLOs differ from SLAs. Which do you operationalize?
- How do you map SLOs to alerting without creating noise?
- What KPIs would you track for a critical trading or advisor-facing platform?
Red Flag: Talks only about CPU, memory, uptime
- Alerting Strategy & Threshold Design
Purpose: Assess signal-to-noise maturity
- How do you decide when to use static vs dynamic thresholds?
- Explain how you prevent alert storms during high traffic or seasonal spikes.
- What makes an alert actionable?
- How do you design alerts for early symptom detection?
Follow-up
- What happens after an alert fires? Walk me through the lifecycle.
- Dynamic Thresholds & Anomaly Detection
Purpose: Validate AIOps fundamentals
- How do dynamic thresholds work under the hood?
- How do you account for baseline drift and seasonality?
- What risks do dynamic thresholds introduce?
- How would you tune sensitivity to avoid false positives?
Expected Concepts Baselines
ML models
Adaptive learning
Time-series analysis
- Multiplexing (Metrics, Signals, Streams)
Purpose: Test system observability depth
- What is multiplexing in observability?
- How do multiple telemetry signals strengthen diagnosis?
- Provide an example where one signal was misleading.
- How do you correlate metrics, traces, logs, and events?
- JSON Tooling & Proactive Detection
Purpose: Ensure hands-on operational telemetry skills
- How have you used JSON-based event payloads to enrich observability?
- How do you normalize data across heterogeneous sources?
- How do structured logs improve proactive detection?
- How do you extract signals from high-volume telemetry?
- Proactive vs Reactive Detection
Purpose: Directly aligned to LPL concern
- Give an example where you predicted an incident before customer impact.
- What indicators help you identify impending failures?
- How do you measure the success of proactive detection?
- Multi-Service Failure Diagnosis (Critical Question)
Purpose: Core differentiator at LPL
Scenario Question
A user-facing issue is reported. The architecture includes:
- Frontend
- Backend microservices
- Downstream APIs
- Kafka streams
- Terraform-managed infrastructure
Ask:
- How do you determine if the issue is:
- Application-related?
- Kafka or streaming lag?
- Downstream API latency?
- Infrastructure drift via Terraform?
Expected Approach Dependency mapping
Golden signals
Trace correlation
Change analysis
- Dynatrace (Mandatory)
Purpose: Address explicit gap in feedback
- What Dynatrace features have you used most?
- How does Davis AI determine root cause?
- How do you implement service-level baselining in Dynatrace?
- How do you reduce alert noise using Dynatrace?
Red Flag: I ve mostly used dashboards
- AI/ML & AIOps Fundamentals
Purpose: Ensure non-theoretical knowledge
- What ML techniques are commonly used in AIOps?
- How do supervised vs unsupervised models differ in Ops?
- Where does AI fail in observability?
- How do you validate AI-based decisions?
- GenAI & LLM Use Cases for SRE
Purpose: Explicit LPL requirement
- Where do you see GenAI adding value in SRE?
- Have you used LLMs for incident response?
- How would you integrate GenAI without introducing risk?
- What data would you restrict from LLM exposure?