Site Reliability Engineering (SRE) / Observability Engineer

• Posted 12 days ago • Updated 6 hours ago
Full Time
Part Time
USD $55-57/hr
Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

  • SANS
  • Amazon Web Services
  • Jersey
  • JD
  • Reliability Engineering
  • Service Level
  • MEAN Stack
  • Recovery
  • Trend Analysis
  • CPU
  • Thread
  • HTTP
  • Kubernetes
  • Management
  • Software Performance Management
  • Instrumentation
  • Analytics
  • IDS
  • Macros
  • Regulatory Compliance
  • Data Deduplication
  • Routing
  • Continuous Improvement
  • Microservices
  • Cloud Computing
  • Incident Management
  • Mapping
  • Streaming
  • Budget
  • Grafana
  • Dynatrace
  • Splunk
  • Dashboard
  • SPL
  • Reporting

Summary


Job Title: Observability / Monitoring Consultant (AWS Data Platforms)

Location: Jersey City, NJ, Dallas, TX, New York, NY

Duration: 12 Months Contract

Formal Job Description (JD) Site Reliability Engineering (SRE) / Observability Engineer

Role Summary

Responsible for defining, implementing, and operationalizing Service Level Indicators (SLIs) and Service Level Objectives (SLOs), building end-to-end observability (metrics, logs, traces), and delivering dashboards and alerting that improve reliability outcomes such as availability, Mean Time to Detect (MTTD), and Mean Time to Restore (MTTR).

Key Responsibilities

SLI / SLO & Error Budget Ownership

  • Partner with application and platform teams to derive SLIs aligned to user journeys and business outcomes (e.g., request success, latency, freshness, saturation).
  • Define and maintain SLOs (including target, window, and measurement method), document rationale, and ensure ongoing governance.
  • Implement and operationalize error budgets (burn rate alerting, SLO reporting, and escalation policies).
  • Establish standards for availability measurement (clear denominator/numerator definitions, maintenance exclusions, regional weighting where applicable).

Reliability Metrics (MTTR / MTTD / Availability)

  • Define consistent calculation methods for MTTD and MTTR (start/stop timestamps, incident severity mapping, and data sources such as incident tools and monitoring events).
  • Produce reliability reporting and insights (trend analysis, top contributors, recurring failure patterns).
  • Drive incident hygiene improvements (detection coverage, alert quality, runbooks, post-incident actions).

4 Golden Signals & Error Rates

  • Instrument and monitor the 4 Golden Signals:
    • Latency (including p95 / p99 percentiles and tail behavior)
    • Traffic (RPS, throughput, message rates)
    • Errors (error rates by endpoint/service, SLO-based error ratios, dependency errors)
    • Saturation (CPU, memory, thread pools, queue depth, connection pools)
  • Establish error-rate definitions (HTTP 5xx/4xx policy, timeouts, retries, partial failures) and ensure consistency across services.

Application / Platform Integration (Grafana, Dynatrace, etc.)

  • Integrate observability into applications and platforms (cloud, containers, Kubernetes, service mesh, APIs, data pipelines).
  • Configure and manage monitoring integrations and instrumentation standards using tools such as:
    • Grafana (dashboards, alert rules, data sources, templating, panel standards)
    • Dynatrace (APM instrumentation, service flow, distributed traces, anomaly detection, SLO modules where applicable)
  • Ensure correlation across telemetry types (link metrics logs traces; consistent tagging/labels such as service, environment, region, version).

Data Freshness SLIs (Analytics / Pipelines)

  • Define and implement data freshness indicators (e.g., time since last successful ingest, event-time lag, end-to-end pipeline latency ).
  • Create alerting on freshness breaches and integrate with incident response and runbooks.
  • Validate freshness SLIs against source-of-truth timestamps and downstream consumption requirements.

Splunk Logs: Queries & Dashboarding

  • Establish logging standards (structured logging fields, severity levels, correlation IDs, sampling strategy, retention).
  • Build and maintain Splunk dashboards using logs and queries (e.g., Splunk Processing Language (SPL)), including:
    • Error-rate and top error signatures
    • Latency breakdowns derived from logs where needed
    • Incident investigation views (trace/correlation ID pivots)
  • Develop reusable query patterns and knowledge objects (saved searches, macros, field extractions) and ensure performance/cost awareness.

Dashboards, Alerting, and Operational Readiness

  • Create executive and engineering dashboards showing SLO compliance, error budget burn, golden signals, and key dependencies.
  • Design actionable alerting (noise reduction, deduplication, symptom vs cause separation, severity routing).
  • Produce runbooks and operational docs (alert response, triage steps, known failure modes, rollback guidance).
  • Support incident response and continuous improvement (post-incident reviews, corrective actions, reliability backlog).

Required Qualifications / Skills

  • Strong experience defining and operationalizing SLIs, SLOs, error budgets, and reliability metrics (availability, MTTD, MTTR).
  • Hands-on capability building dashboards and alerts in Grafana and/or Dynatrace.
  • Strong logging and analysis experience with Splunk, including building dashboards from log-based queries (SPL).
  • Solid understanding of latency measurement, including percentiles (p95, p99) and implications of tail latency.
  • Demonstrated ability to integrate observability across distributed systems (microservices, APIs, containers, cloud services, data pipelines).
  • Practical incident management experience and ability to translate telemetry into improved detection and faster restoration.

Preferred Qualifications

  • Experience with distributed tracing standards (e.g., OpenTelemetry) and service dependency mapping.
  • Experience designing monitoring for data platforms (streaming/batch pipelines, freshness and completeness SLIs).
  • Familiarity with SLO reporting automation and burn-rate based alerting patterns.

Core Deliverables

  • SLI/SLO catalog with documented definitions and data sources
  • SLO dashboards and error budget burn views (Grafana/Dynatrace)
  • Golden-signal dashboards per service and dependency layer
  • Splunk dashboards and reusable SPL query library
  • Alerting standards, tuned alert rules, runbooks, and reliability reporting for availability/MTTD/MTTR

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: 91165685
  • Position Id: OOJ - 1138-139-1771532007
  • Posted 12 days ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in New York, New York

18d ago

Easy Apply

Contract

Depends on Experience

Jersey City, New Jersey

Today

Easy Apply

Contract

$62 - $65 per hour

New York, New York

5d ago

Full-time

USD 110,000.00 - 130,000.00 per year

Jersey City, New Jersey

Today

Full-time

USD 152,000.00 - 215,000.00 per year

Search all similar jobs