Overview
Skills
Job Details
Network Support Engineer
Santa Clara, CA 95051 - Onsite/Hybrid
PAY RATE - $53/HR ON C2C
Contract
Top Skills:
Data Center & AI Cluster Networking
High-performance interconnects GPU, HPC, AI clusters
InfiniBand, Ultra Ethernet, ROCEv2, DCQCN
Dark Fiber / Carrier Interconnect Optimization
Hybrid DC Network Architecture & Fabric Design
Job Description/Responsibilities:
This is a hands-on network engineering position focused on the architecture, design, development and deployment of ultra-high-speed, resilient, and scalable DC AI Clusters and Interconnects for GPU-accelerated data centers and compute clusters. Outstanding problem-solving abilities and a comprehensive understanding of the network security protocols & standards, routing, switching, automation and deep understanding of fundamental network theory is also critical to your success at NVIDIA.
Key Responsibilities
Lead the architecture, design, and deployment of global-scale DCs inter-connects and fabric for HPC, AI, and GPU computing clusters.
Develop high-performance data center fabric using InfiniBand, Ultra Ethernet and related technologies.
Optimize carrier interconnects, intra and inter DC routing, and dark fiber deployments to ensure low latency and high reliability.
Partner with system, OS, GPU, and HPC teams to deliver scalable, highly available networks for extreme-performance workloads.
Implement network monitoring, telemetry, solving, and continuous performance improvement processes.
Drive technology selection, vendor engagement, and lifecycle management for Data Center hardware and software.
What We re Looking For:
Minimum 6-8+ years of experience in building, managing and supporting large scale hybrid networks, developing automation pipelines with Python, Ruby, Go or other languages used in infrastructure automation.
SME in networking technologies: InfiniBand, Ultra Ethernet, ROCEv2, DCQCN, TCP/UDP, IPv4/IPv6, BGP/MP-BGP, VPN, L2 switching, EVPN, VxLAN, Segment Routing, MPLS.
Experience automating network infrastructure
Experience using an automated configuration management system (Python Ansible, Salt, etc.)