Overview
Skills
Job Details
Client: Arista, California City, CA (Hybrid 3 days/week as per client)
This is not a traditional operations role. You ll own critical, hands-on operational tasks while leading efforts to eliminate manual toil through automation and systems engineering.
You ll work closely with engineering, tooling and platform teams to ensure operational excellence and system reliability across customer deployments.
Key Responsibilities
Phase 1: Stabilize and Map (0 - 6 Months)
- Own operational workload: deployments, upgrades, incident response.
- Ensure stability while identifying manual pain points.
Phase 2: Automate and Influence (6 -18 Months)
- Automate repetitive operational tasks using scripting and IaC.
- Develop internal tooling; collaborate with platform teams to reduce manual effort.
Phase 3: Architect and Evangelize (2+ Years)
- Define SLOs, improve observability and influence product design for reliability.
- Promote SRE principles across engineering.
Required Skills
- DevOps/SRE Experience: Strong background in Site Reliability or DevOps engineering.
- Linux & Networking: Strong command of Linux systems, networking fundamentals (TCP/IP, DNS, routing).
- Cloud Infrastructure: Hands-on experience with AWS (VPC, EC2, IAM, S3) and Terraform.
- Monitoring & Observability: Build and manage telemetry pipelines (metrics, logs, traces).
- Automation & Coding: Proficient in Python or Go, strong Bash scripting skills.
- Incident Management: Skilled at stabilizing crises and designing long-term prevention systems.
Preferred Skills
- Experience with Kafka, Postgres, nginx, systemd.
- Familiarity with Nix/NixOS (training provided if new).
- Exposure to functional programming (Scala, Haskell, Rust, etc.) is a plus.
You will directly impact customer success while driving the evolution of Arista s reliability engineering culture moving from manual fixes to automated, scalable systems.