ROLE: Network SRE (Site Reliability Engineer – Networking)
Location: Phoenix
Experience:10–16+ Network Engineering with SRE / Automation focus
Role Summary
We are seeking a Network SRE to ensure the reliability, scalability, and performance of cloud and hybrid network platforms.
This role applies SRE principles to networking by shifting from manual network operations to automated, observable, and resilient network services.
The ideal candidate is a network engineer who thinks like a software engineer and SRE.
Key Responsibilities
Network Reliability Engineering
- Define SLIs, SLOs, and Error Budgets for network services.
Design networks for:
- High availability
- Fault tolerance
- Low latency
- Predictable performance
Improve network reliability while reducing operational toil.
Cloud & Hybrid Networking
Architect and operate AWS networking:
- VPCs, Subnets, Route Tables
- Transit Gateway
- NAT, IGW
- PrivateLink, VPC Endpoints
Design hybrid connectivity:
Support multi-account and multi-region architectures.
Network Observability & Monitoring
Build deep network observability using:
- VPC Flow Logs
- CloudWatch
- Datadog
- Prometheus / Grafana
Analyze packet loss, latency, and throughput.
Implement proactive alerting based on SLOs.
Correlate network signals with application performance.
Automation & Infrastructure as Code
Automate network provisioning and changes using:
- Terraform / CloudFormation
Implement CI/CD for network changes.
Reduce manual configuration and human error.
Version-control network definitions.
Incident Response & Troubleshooting
Lead network-related incident response.
Perform deep root-cause analysis for:
- Packet drops
- Routing issues
- DNS failures
- Load balancer degradation
Participate in on-call rotation and post-incident reviews.
Drive permanent fixes rather than workarounds.
Security & Traffic Management
Design and enforce:
- Network segmentation
- Zero-Trust principles
- Firewall rules (Security Groups, NACLs)
Implement secure ingress/egress patterns.
Support DDoS protection (AWS Shield, WAF).
Work with Security teams on audits and remediation.
Performance & Capacity Planning
Conduct traffic modeling and capacity forecasting.
Tune load balancers (ALB, NLB).
Optimize routing and failover strategies.
Validate resilience through failure testing.
Collaboration & Enablement
Partner with:
- Cloud Platform teams
- Application SREs
- Security & Infra teams
Enable application teams with network best practices.
Produce architecture diagrams, runbooks, and SOPs.
Influence platform design decisions.
Required Skills & Qualifications
Must-Have:
Strong networking fundamentals (TCP/IP, DNS, BGP, routing)
AWS networking expertise
SRE concepts & practices
Network observability & monitoring
Infrastructure as Code
Production incident handling experience