Position Overview
We are looking for a Senior Network Engineer specializing in Network Resiliency and High Availability (HA) to ensure our global network infrastructure remains up, fault-tolerant, and capable of seamless disaster recovery. In this role, you will be the primary architect and guardian of uptime. You will design, validate, and optimize redundant paths, automated failover systems, and high-availability clusters across our data centers, campus environments, WAN edge, and cloud boundaries. Your mission is simple: ensure zero packet loss for business-critical applications during unforeseen hardware or carrier failures.
Key Responsibilities
Resiliency Architecture & Chaos Engineering
· Design, implement, and maintain high-availability network topologies using physical and logical redundancy patterns (e.g., Multi-Chassis EtherChannel/MCLAG, VPC, and VSS).
· Architect redundant Wide Area Network (WAN) transport paths utilizing dual-homed ISP connections, SD-WAN dynamic path selection, and automated failover technologies.
· Conduct controlled Network Chaos Engineering exercises (e.g., simulating fiber cuts, device power failures, and split-brain scenarios) to validate failover timers and resilience assumptions.
Dynamic Routing & Fast Convergence
· Optimize enterprise routing protocols (BGP, OSPF, EIGRP) for ultra-fast convergence, tuning features like Bidirectional Forwarding Detection (BFD), Fast Reroute (FRR), and Graceful Restart.
· Implement First Hop Redundancy Protocols (HSRP, VRRP, GLBP) to guarantee default gateway redundancy for end-user and server segments.
· Manage complex traffic engineering strategies (e.g., BGP local preference, AS-path prepending) to ensure predictable asymmetric/symmetric routing during failure states.
Disaster Recovery (DR) & Business Continuity
· Lead the network engineering track for Corporate Disaster Recovery planning, including active-active and active-passive data center strategies.
· Design, configure, and maintain automated DNS-based failover (GSLB) and Anycast routing strategies to reroute user traffic away from degraded data centers or cloud regions.
· Keep comprehensive, up-to-date documentation on failover runbooks and infrastructure dependency maps.
Observability & Proactive Management
· Deploy advanced monitoring tools to track metrics like Mean Time to Detect (MTTD) and Mean Time to Repair (MTTR).
· Set up telemetry-based alerting (SNMP, gRPC/Streaming Telemetry) to identify gray failures (e.g., high interface error rates causing intermittent drops) before they cause total outages.
Experience & Education
· Experience: 5+ years in a dedicated network engineering or operations role, with a proven track record of designing 99.99% or 99.999% (Four-to-Five Nines) uptime environments.
· Education: Bachelor''s degree in Computer Science, Computer Engineering, or equivalent practical experience.
Preferred Certifications
· Cisco Certified Internetwork Expert (CCIE - Enterprise Infrastructure or Data Center) or strong CCNP with equivalent experience.
· Juniper Networks Certified Internetworking Specialist/Expert (JNCIS/JNCIE).
· Certified Business Continuity Professional (CBCP) or equivalent familiarity with DR frameworks is a plus.