Site Reliability Engineer

Overview

On Site
$70 - $80
Contract - Independent
Contract - W2
Contract - 12 Month(s)

Skills

SRE for AI systems
OpenTelemetry
Prometheus
APM Tools
Grafana
Splunk
AI observability

Job Details

Site Reliability Engineer
Responsibilities
Design observability stacks tailored for AI agent performance (latency, cost, quality).
Implement anomaly detection for runtime errors, hallucinations, and agent drifts

Collaborate with Ops/SRE Agent to automate remediation workflows.
Define reliability SLIs/SLOs for agent-driven systems.Architect and operationalize end-to-end observability frameworks (metrics, traces, logs, golden signals) across clusters, workloads, and services.

Shape the orchestration platform roadmap for resiliency, scalability, and operational intelligence in alignment with business objectives.

Requirements
Background in SRE for AI systems or large distributed platforms.
Strong with OpenTelemetry, Prometheus, APM Tools, Grafana, Splunk.
Familiarity with AI observability (LLM trace monitoring, token cost tracking, drift detection).
Ability to integrate AI reliability checks into CI/CD and production environments.
Deep expertise in orchestration platforms (Kubernetes, Nomad, Mesos, or equivalent) at enterprise scale.
Preferred
Experience in AIOps or ML observability.
Background in incident management (PagerDuty, OpsGenie, ServiceNow).
Proven success architecting and delivering AIOps and NoOps solutions including event correlation, AI-driven automation, and self-healing operations.
Experience automating/programming in Python, Go, or similar, with experience building ML- or AI-integrated pipelines.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.