Network/Infrastructure Engineer- remote

Remote • Posted 3 hours ago • Updated 3 hours ago
Full Time
On-site
$150-170K
Fitment

Dice Job Match Score™

📊 Calculating match score...

Job Details

Skills

  • IaaS
  • Pivotal
  • Operations Management
  • Data Centers
  • Performance Tuning
  • Network Design
  • High Performance Computing
  • Network Operations
  • Procurement
  • Network Security
  • Bill Of Materials
  • Billing
  • Compatibility Testing
  • Computer Hardware
  • Stacks Blockchain
  • Scalability
  • Knowledge Base
  • Computer Cluster Management
  • Use Cases
  • Communication
  • Storage
  • Optimization
  • Performance Analysis
  • Firmware
  • Collaboration
  • Incident Management
  • Technical Writing
  • Mentorship
  • Regulatory Compliance
  • Network Administration
  • Access Control
  • Training
  • Computer Science
  • Computer Engineering
  • Information Technology
  • Network Engineering
  • Border Gateway Protocol
  • OSPF
  • VLAN
  • Routing
  • Enterprise Networks
  • Cisco
  • Arista
  • Juniper
  • InfiniBand
  • Ethernet
  • Remote Direct Memory Access
  • Network Monitoring
  • Grafana
  • Nagios
  • IP
  • Intellectual Property
  • Subnetwork
  • IP Address Management
  • HPC
  • Job Scheduling
  • Linux Administration
  • Shell Scripting
  • Network
  • Performance Testing
  • Benchmarking
  • GPU Computing
  • Configuration Management
  • Ansible
  • Terraform
  • Analytical Skill
  • Problem Solving
  • Conflict Resolution
  • Documentation
  • Attention To Detail
  • Effective Communication
  • Management
  • Cisco Certifications
  • Computer Networking
  • Cloud Computing

Summary

Company Overview:
We are a pioneering Infrastructure-as-a-Service (IaaS) company, focusing on delivering High-Performance Computing (HPC) solutions. Our cutting-edge data centers form the core of our operations, empowering us to offer unmatched computational resources to our global clientele. In line with our growth and the expansion of our services, we are on the lookout for a skilled and innovative Network/Infrastructure Engineer to strengthen our team.

Position Summary:
The Network/Infrastructure Engineer is pivotal in designing, implementing, and optimizing the network and compute infrastructure that powers our high-performance computing environments. This role encompasses network architecture design, operational management of complex BGP environments, HPC cluster optimization, and performance benchmarking. The successful applicant will collaborate closely with NVIDIA, deployment teams, and cross-functional engineering groups to ensure our infrastructure delivers exceptional performance and reliability. Travel to Data Centers located within the US may sometimes be required to support network deployments, troubleshooting, or performance optimization initiatives

Key Responsibilities:
Network Design & Architecture:
Design physical and logical network topologies for high-performance computing environments supporting large-scale workloads
Maintain IP address management (IPAM) schemes ensuring efficient allocation and documentation
Create comprehensive network diagrams and technical documentation for current and future infrastructure
Collaborate with NVIDIA on Reference Architecture standards to ensure adherence to best practices and optimal configurations
Evaluate and recommend network technologies and solutions to meet evolving business requirements

Network Operations:
Configure and maintain BGP peering sessions with ISPs, partners, and internal autonomous systems
Monitor network health using observability tools, identifying and resolving performance bottlenecks
Respond to network incidents and perform advanced troubleshooting to minimize downtime
Coordinate IP block procurement and assignment, working with RIRs and transit providers
Maintain network security posture and implement changes following established protocols
Participate in on-call rotation for critical network incidents

Network Projects:
Develop detailed network BOMs (Bills of Materials) for new deployments in collaboration with deployment teams
Test and validate network configurations in lab environments prior to production deployment
Evaluate driver upgrades and perform compatibility testing across network hardware and software stacks
Design and implement network enhancements to improve performance, reliability, and scalability
Execute comprehensive network performance benchmarking using industry-standard tools and methodologies
Document project outcomes and create knowledge base articles for operational teams

HPC Cluster Management:
Optimize cluster performance and utilization through tuning of network fabric, storage, and compute resources
Test and validate deployment profiles for various HPC workloads and use cases
Configure and maintain high-speed interconnects (InfiniBand, RoCE) for low-latency communication
Work with infrastructure teams to ensure proper integration of compute, storage, and network components

Performance & Optimization:
Conduct rigorous benchmarking and performance analysis of HPC infrastructure using tools such as IOR, NCCL, and MLPerf
Test driver and firmware upgrades in HPC context, validating compatibility and performance impact
Troubleshoot complex compute node and interconnect issues affecting application performance
Document HPC-specific configurations and tuning parameters for various workload types
Identify and implement optimizations for network throughput, latency, and job completion times

Collaboration and Documentation:
Work closely with deployment engineers to ensure successful network implementation
Collaborate with infrastructure operations teams on incident response and problem resolution
Maintain comprehensive technical documentation including network diagrams, runbooks, and configuration standards
Participate in architecture review sessions and contribute to infrastructure planning
Mentor junior team members on networking concepts and HPC technologies

Safety and Compliance:
Adhere to strict data center safety protocols and operational standards during all on-site activities
Follow security best practices for network configuration and access control
Participate in regular safety training and briefings

Qualifications:
Bachelor's degree in Computer Science, Computer Engineering, Information Technology, or a related field preferred
3-5 years of experience in network engineering, with emphasis on large-scale data center or HPC environments
Expert-level knowledge of networking protocols including BGP, OSPF, VLANs, and routing fundamentals
Strong hands-on experience with enterprise network equipment from vendors such as Cisco, Arista, NVIDIA (Mellanox), or Juniper
Proficiency with high-speed interconnect technologies including InfiniBand, Ethernet RDMA (RoCE), and related protocols
Experience with network monitoring and observability tools (Prometheus, Grafana, Nagios, or similar)
Deep understanding of IP addressing, subnetting, and IPAM management
Demonstrated experience with HPC cluster architectures and job scheduling systems (Slurm, PBS, or similar)
Strong Linux system administration skills including shell scripting and automation
Experience with network performance testing tools and benchmarking methodologies
Familiarity with NVIDIA GPU computing architectures and networking solutions preferred
Knowledge of software-defined networking (SDN) concepts and implementation
Experience with configuration management tools (Ansible, Terraform, or similar) preferred
Strong analytical and troubleshooting skills with systematic problem-solving approach
Excellent documentation skills with attention to detail
Effective communication skills, both written and verbal, with ability to explain complex technical concepts to diverse audiences
Self-motivated with ability to work independently and manage multiple projects simultaneously
Availability to participate in on-call rotation and travel occasionally to data center locations as required

Preferred Certifications:
CCNP, CCIE, or equivalent networking certifications
NVIDIA networking certifications
Relevant cloud or data center certifications
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: cxbcsi
  • Position Id: Job44584
  • Posted 3 hours ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Remote or Chicago, Illinois

Today

Full-time

USD 90,900.00 - 118,000.00 per year

Remote or California

Today

Full-time

Remote or Anchorage, Alaska

Today

Full-time

USD 50.00 - 58.00 per hour

Remote or Cambridge, England

3d ago

Full-time

USD 88,900.00 - 160,100.00 per year

Search all similar jobs