Site Reliability Engineering (SRE) / Observability Engineer
Role Summary
Responsible for defining, implementing, and operationalizing Service Level Indicators (SLIs) and Service Level Objectives (SLOs), building end-to-end observability (metrics, logs, traces), and delivering dashboards and alerting that improve reliability outcomes such as availability, Mean Time to Detect (MTTD), and Mean Time to Restore (MTTR).
Key Responsibilities
SLI / SLO & Error Budget Ownership
- Partner with application and platform teams to derive SLIs aligned to user journeys and business outcomes (e.g., request success, latency, freshness, saturation).
- Define and maintain SLOs (including target, window, and measurement method), document rationale, and ensure ongoing governance.
- Implement and operationalize error budgets (burn rate alerting, SLO reporting, and escalation policies).
- Establish standards for availability measurement (clear denominator/numerator definitions, maintenance exclusions, regional weighting where applicable).
Reliability Metrics (MTTR / MTTD / Availability)
- Define consistent calculation methods for MTTD and MTTR (start/stop timestamps, incident severity mapping, and data sources such as incident tools and monitoring events).
- Produce reliability reporting and insights (trend analysis, top contributors, recurring failure patterns).
- Drive incident hygiene improvements (detection coverage, alert quality, runbooks, post-incident actions).
4 Golden Signals & Error Rates
- Instrument and monitor the 4 Golden Signals:
- Latency (including p95 / p99 percentiles and tail behavior)
- Traffic (RPS, throughput, message rates)
- Errors (error rates by endpoint/service, SLO-based error ratios, dependency errors)
- Saturation (CPU, memory, thread pools, queue depth, connection pools)
- Establish error-rate definitions (HTTP 5xx/4xx policy, timeouts, retries, partial failures) and ensure consistency across services.
Application / Platform Integration (Grafana, Dynatrace, etc.)
- Integrate observability into applications and platforms (cloud, containers, Kubernetes, service mesh, APIs, data pipelines).
- Configure and manage monitoring integrations and instrumentation standards using tools such as:
- Grafana (dashboards, alert rules, data sources, templating, panel standards)
- Dynatrace (APM instrumentation, service flow, distributed traces, anomaly detection, SLO modules where applicable)
- Ensure correlation across telemetry types (link metrics logs traces; consistent tagging/labels such as service, environment, region, version).
Data Freshness SLIs (Analytics / Pipelines)
- Define and implement data freshness indicators (e.g., time since last successful ingest, event-time lag, end-to-end pipeline latency ).
- Create alerting on freshness breaches and integrate with incident response and runbooks.
- Validate freshness SLIs against source-of-truth timestamps and downstream consumption requirements.
Splunk Logs: Queries & Dashboarding
- Establish logging standards (structured logging fields, severity levels, correlation IDs, sampling strategy, retention).
- Build and maintain Splunk dashboards using logs and queries (e.g., Splunk Processing Language (SPL)), including:
- Error-rate and top error signatures
- Latency breakdowns derived from logs where needed
- Incident investigation views (trace/correlation ID pivots)
- Develop reusable query patterns and knowledge objects (saved searches, macros, field extractions) and ensure performance/cost awareness.
Dashboards, Alerting, and Operational Readiness
- Create executive and engineering dashboards showing SLO compliance, error budget burn, golden signals, and key dependencies.
- Design actionable alerting (noise reduction, deduplication, symptom vs cause separation, severity routing).
- Produce runbooks and operational docs (alert response, triage steps, known failure modes, rollback guidance).
- Support incident response and continuous improvement (post-incident reviews, corrective actions, reliability backlog).
Required Qualifications / Skills
- Strong experience defining and operationalizing SLIs, SLOs, error budgets, and reliability metrics (availability, MTTD, MTTR).
- Hands-on capability building dashboards and alerts in Grafana and/or Dynatrace.
- Strong logging and analysis experience with Splunk, including building dashboards from log-based queries (SPL).
- Solid understanding of latency measurement, including percentiles (p95, p99) and implications of tail latency.
- Demonstrated ability to integrate observability across distributed systems (microservices, APIs, containers, cloud services, data pipelines).
- Practical incident management experience and ability to translate telemetry into improved detection and faster restoration.
Preferred Qualifications
- Experience with distributed tracing standards (e.g., OpenTelemetry) and service dependency mapping.
- Experience designing monitoring for data platforms (streaming/batch pipelines, freshness and completeness SLIs).
- Familiarity with SLO reporting automation and burn-rate based alerting patterns.
Core Deliverables
- SLI/SLO catalog with documented definitions and data sources
- SLO dashboards and error budget burn views (Grafana/Dynatrace)
- Golden-signal dashboards per service and dependency layer
- Splunk dashboards and reusable SPL query library
- Alerting standards, tuned alert rules, runbooks, and reliability reporting for availability/MTTD/MTTR
Regards,
Radiantze Inc