Sr Site Reliability Engineer

Overview

Hybrid

$70 - $75

Contract - W2

Skills

linux

kubernetes

sase

grafana

Terraform

ceph

Job Details

Senior Site Reliability Engineer (Contract to Hire)

Location: McKinney, TX (Hybrid, 2 3 days onsite) Must be authorized to work in the U.S.

Overview: Our client is seeking a Senior Site Reliability Engineer to lead platform reliability and traffic enforcement in a Kubernetes-hosted SASE (Secure Access Service Edge) environment. This role ensures high availability, observability, and fair multi-tenant traffic handling across distributed systems.

Key Responsibilities:

Platform Reliability & Operations

Own uptime (target: 99.99%) and stability of multi-region Kubernetes environments.
Architect resilient, scalable infrastructure with proactive capacity planning and automated remediation.
Lead incident response, root cause analysis, disaster recovery, and change management.

Observability & Monitoring

Build a full-stack observability pipeline (Prometheus, OpenTelemetry, Grafana, etc.).
Implement golden signals, tracing, and alerting to drive real-time performance insights.
Develop automation for issue detection and resolution.

Kubernetes & Infrastructure

Manage full Kubernetes lifecycle (upgrades, autoscaling, GitOps automation).
Integrate and optimize OpenStack-based infrastructure beneath Kubernetes.
Enforce security compliance, resource efficiency, and FinOps best practices.

Traffic Enforcement & Networking

Design a Kubernetes-native traffic control layer for per-tenant/session enforcement.
Implement CRDs, custom controllers, and service mesh (e.g., Istio, Linkerd) for dynamic policy management.
Operate SDN telemetry agents (Cilium Hubble, WireGuard) and integrate with observability stack.

Leadership & Strategy

Contribute to infrastructure architecture and reliability strategy.
Mentor team members and promote Kubernetes best practices.
Partner cross-functionally across engineering, security, and product teams.

Required Skills:

Kubernetes in production across multi-region architectures.
Observability tools: Prometheus, OpenTelemetry, Grafana, Jaeger, Loki.
Strong Linux networking (tc, nftables, WireGuard, iptables).
Infrastructure automation: Helm, Terraform, ArgoCD/Flux (GitOps).
Programming: Go (preferred), Python/Bash scripting.
Familiarity with OpenStack (Nova, Neutron, Ceph) and CNI (Cilium preferred).

Preferred Experience:

Service mesh deployment (Istio, Linkerd), multi-cluster tools (Fleet, Rancher).
Chaos engineering frameworks (Chaos Mesh, Litmus).
Developer platform abstraction on Kubernetes.
FinOps cost optimization practices.
Edge Kubernetes and NFV/SDN background.
Active participation in the Kubernetes community.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.