Overview
Skills
Job Details
Site Reliability Engineer
Responsibilities
Design observability stacks tailored for AI agent performance (latency, cost, quality).
Implement anomaly detection for runtime errors, hallucinations, and agent drifts
Collaborate with Ops/SRE Agent to automate remediation workflows.
Define reliability SLIs/SLOs for agent-driven systems.Architect and operationalize end-to-end observability frameworks (metrics, traces, logs, golden signals) across clusters, workloads, and services.
Shape the orchestration platform roadmap for resiliency, scalability, and operational intelligence in alignment with business objectives.
Requirements
Background in SRE for AI systems or large distributed platforms.
Strong with OpenTelemetry, Prometheus, APM Tools, Grafana, Splunk.
Familiarity with AI observability (LLM trace monitoring, token cost tracking, drift detection).
Ability to integrate AI reliability checks into CI/CD and production environments.
Deep expertise in orchestration platforms (Kubernetes, Nomad, Mesos, or equivalent) at enterprise scale.
Preferred
Experience in AIOps or ML observability.
Background in incident management (PagerDuty, OpsGenie, ServiceNow).
Proven success architecting and delivering AIOps and NoOps solutions including event correlation, AI-driven automation, and self-healing operations.
Experience automating/programming in Python, Go, or similar, with experience building ML- or AI-integrated pipelines.