Site Reliability Engineering (SRE) / Observability Engineer

• Posted 12 days ago • Updated 6 hours ago

Full Time

Part Time

USD $55-57/hr

Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

SANS
Amazon Web Services
Jersey
JD
Reliability Engineering
Service Level
MEAN Stack
Recovery
Trend Analysis
CPU
Thread
HTTP
Kubernetes
Management
Software Performance Management
Instrumentation
Analytics
IDS
Macros
Regulatory Compliance
Data Deduplication
Routing
Continuous Improvement
Microservices
Cloud Computing
Incident Management
Mapping
Streaming
Budget
Grafana
Dynatrace
Splunk
Dashboard
SPL
Reporting

Summary


Job Title: Observability / Monitoring Consultant (AWS Data Platforms)

Location: Jersey City, NJ, Dallas, TX, New York, NY

Duration: 12 Months Contract

Formal Job Description (JD) Site Reliability Engineering (SRE) / Observability Engineer

Role Summary

Responsible for defining, implementing, and operationalizing Service Level Indicators (SLIs) and Service Level Objectives (SLOs), building end-to-end observability (metrics, logs, traces), and delivering dashboards and alerting that improve reliability outcomes such as availability, Mean Time to Detect (MTTD), and Mean Time to Restore (MTTR).

Key Responsibilities

SLI / SLO & Error Budget Ownership

Partner with application and platform teams to derive SLIs aligned to user journeys and business outcomes (e.g., request success, latency, freshness, saturation).
Define and maintain SLOs (including target, window, and measurement method), document rationale, and ensure ongoing governance.
Implement and operationalize error budgets (burn rate alerting, SLO reporting, and escalation policies).
Establish standards for availability measurement (clear denominator/numerator definitions, maintenance exclusions, regional weighting where applicable).

Reliability Metrics (MTTR / MTTD / Availability)

Define consistent calculation methods for MTTD and MTTR (start/stop timestamps, incident severity mapping, and data sources such as incident tools and monitoring events).
Produce reliability reporting and insights (trend analysis, top contributors, recurring failure patterns).
Drive incident hygiene improvements (detection coverage, alert quality, runbooks, post-incident actions).

4 Golden Signals & Error Rates

Instrument and monitor the 4 Golden Signals:
- Latency (including p95 / p99 percentiles and tail behavior)
- Traffic (RPS, throughput, message rates)
- Errors (error rates by endpoint/service, SLO-based error ratios, dependency errors)
- Saturation (CPU, memory, thread pools, queue depth, connection pools)
Establish error-rate definitions (HTTP 5xx/4xx policy, timeouts, retries, partial failures) and ensure consistency across services.

Application / Platform Integration (Grafana, Dynatrace, etc.)

Integrate observability into applications and platforms (cloud, containers, Kubernetes, service mesh, APIs, data pipelines).
Configure and manage monitoring integrations and instrumentation standards using tools such as:
- Grafana (dashboards, alert rules, data sources, templating, panel standards)
- Dynatrace (APM instrumentation, service flow, distributed traces, anomaly detection, SLO modules where applicable)
Ensure correlation across telemetry types (link metrics logs traces; consistent tagging/labels such as service, environment, region, version).

Data Freshness SLIs (Analytics / Pipelines)

Define and implement data freshness indicators (e.g., time since last successful ingest, event-time lag, end-to-end pipeline latency ).
Create alerting on freshness breaches and integrate with incident response and runbooks.
Validate freshness SLIs against source-of-truth timestamps and downstream consumption requirements.

Splunk Logs: Queries & Dashboarding

Establish logging standards (structured logging fields, severity levels, correlation IDs, sampling strategy, retention).
Build and maintain Splunk dashboards using logs and queries (e.g., Splunk Processing Language (SPL)), including:
- Error-rate and top error signatures
- Latency breakdowns derived from logs where needed
- Incident investigation views (trace/correlation ID pivots)
Develop reusable query patterns and knowledge objects (saved searches, macros, field extractions) and ensure performance/cost awareness.

Dashboards, Alerting, and Operational Readiness

Create executive and engineering dashboards showing SLO compliance, error budget burn, golden signals, and key dependencies.
Design actionable alerting (noise reduction, deduplication, symptom vs cause separation, severity routing).
Produce runbooks and operational docs (alert response, triage steps, known failure modes, rollback guidance).
Support incident response and continuous improvement (post-incident reviews, corrective actions, reliability backlog).

Required Qualifications / Skills

Strong experience defining and operationalizing SLIs, SLOs, error budgets, and reliability metrics (availability, MTTD, MTTR).
Hands-on capability building dashboards and alerts in Grafana and/or Dynatrace.
Strong logging and analysis experience with Splunk, including building dashboards from log-based queries (SPL).
Solid understanding of latency measurement, including percentiles (p95, p99) and implications of tail latency.
Demonstrated ability to integrate observability across distributed systems (microservices, APIs, containers, cloud services, data pipelines).
Practical incident management experience and ability to translate telemetry into improved detection and faster restoration.

Preferred Qualifications

Experience with distributed tracing standards (e.g., OpenTelemetry) and service dependency mapping.
Experience designing monitoring for data platforms (streaming/batch pipelines, freshness and completeness SLIs).
Familiarity with SLO reporting automation and burn-rate based alerting patterns.

Core Deliverables

SLI/SLO catalog with documented definitions and data sources
SLO dashboards and error budget burn views (Grafana/Dynatrace)
Golden-signal dashboards per service and dependency layer
Splunk dashboards and reusable SPL query library
Alerting standards, tuned alert rules, runbooks, and reliability reporting for availability/MTTD/MTTR

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Dice Id: 91165685
Position Id: OOJ - 1138-139-1771532007
Posted 12 days ago

Create job alert

Never miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Python Developer {Steamlit}-H1-Transfer

Hybrid in New York, New York

•

18d ago

Job Title: Python Developer (Streamlit) Location: New York, NY / Hybrid (3 days onsite) Experience: 10 Years Experience Were looking for an experienced IT Operations Engineer to support and enhance enterprise infrastructure and datacenter operations. This role focuses on operational monitoring, incident support, reporting, and automation, using Python and Streamlit to improve visibility and efficiency. What Youll Do Communicate clearly with technical and business stakeholders. Ability to demo

Easy Apply

Contract

Depends on Experience

Site Reliability Engineer

Jersey City, New Jersey

•

Today

Position: Site Reliability Engineer Location: Jersey City, NJ/Hybrid Duration: 6-12 Months Contract Pay Rate: $62 to 65/hr Required Qualifications 6+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles Strong hands-on experience with: Grafana (dashboards, alerting, visualization) Dynatrace (APM, infrastructure monitoring, real user monitoring) AWS (EC2, EKS, Lambda, S3, CloudWatch, etc.) Proven experience implementing SLIs, SLOs, and error budgets Experi

Easy Apply

Contract

$62 - $65 per hour

Observability Engineer

New York, New York

•

5d ago

Neuberger Berman's Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premise environments. You will help build and operate a server monitoring platform that continuously validates service health (24/7) across business-critical systems-including external websites and key infrastructure components (e.g., firewalls, OpenShift). You will design and implement end-to-end monitoring solutions spanning logs, metrics, traces, Service Le

Full-time

USD 110,000.00 - 130,000.00 per year

Lead Site Reliability Engineer

Jersey City, New Jersey

•

Today

Job Description Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability. As a Lead Site Reliability Engineer at JPMorgan Chase within Consumer and Community Bank, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design rev

Full-time

USD 152,000.00 - 215,000.00 per year

Search all similar jobs