Overview
Hybrid
$70 - $75
Contract - W2
Skills
linux
kubernetes
sase
grafana
Terraform
ceph
Job Details
Senior Site Reliability Engineer (Contract to Hire)
Location: McKinney, TX (Hybrid, 2 3 days onsite) Must be authorized to work in the U.S.
Overview: Our client is seeking a Senior Site Reliability Engineer to lead platform reliability and traffic enforcement in a Kubernetes-hosted SASE (Secure Access Service Edge) environment. This role ensures high availability, observability, and fair multi-tenant traffic handling across distributed systems.
Key Responsibilities:
Platform Reliability & Operations
- Own uptime (target: 99.99%) and stability of multi-region Kubernetes environments.
- Architect resilient, scalable infrastructure with proactive capacity planning and automated remediation.
- Lead incident response, root cause analysis, disaster recovery, and change management.
Observability & Monitoring
- Build a full-stack observability pipeline (Prometheus, OpenTelemetry, Grafana, etc.).
- Implement golden signals, tracing, and alerting to drive real-time performance insights.
- Develop automation for issue detection and resolution.
Kubernetes & Infrastructure
- Manage full Kubernetes lifecycle (upgrades, autoscaling, GitOps automation).
- Integrate and optimize OpenStack-based infrastructure beneath Kubernetes.
- Enforce security compliance, resource efficiency, and FinOps best practices.
Traffic Enforcement & Networking
- Design a Kubernetes-native traffic control layer for per-tenant/session enforcement.
- Implement CRDs, custom controllers, and service mesh (e.g., Istio, Linkerd) for dynamic policy management.
- Operate SDN telemetry agents (Cilium Hubble, WireGuard) and integrate with observability stack.
Leadership & Strategy
- Contribute to infrastructure architecture and reliability strategy.
- Mentor team members and promote Kubernetes best practices.
- Partner cross-functionally across engineering, security, and product teams.
Required Skills:
- Kubernetes in production across multi-region architectures.
- Observability tools: Prometheus, OpenTelemetry, Grafana, Jaeger, Loki.
- Strong Linux networking (tc, nftables, WireGuard, iptables).
- Infrastructure automation: Helm, Terraform, ArgoCD/Flux (GitOps).
- Programming: Go (preferred), Python/Bash scripting.
- Familiarity with OpenStack (Nova, Neutron, Ceph) and CNI (Cilium preferred).
Preferred Experience:
- Service mesh deployment (Istio, Linkerd), multi-cluster tools (Fleet, Rancher).
- Chaos engineering frameworks (Chaos Mesh, Litmus).
- Developer platform abstraction on Kubernetes.
- FinOps cost optimization practices.
- Edge Kubernetes and NFV/SDN background.
- Active participation in the Kubernetes community.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.