Infrastructure Operations Engineer (Data Center) - onsite

charlotte, NC, US • Posted 1 day ago • Updated 1 hour ago
Full Time
On-site
$80-110K
Fitment

Dice Job Match Score™

🧠 Analyzing your skills...

Job Details

Skills

  • Data Centers
  • Reasoning
  • Root Cause Analysis
  • Vendor Management
  • Workflow
  • Accountability
  • Linux Administration
  • Command-line Interface
  • Performance Monitoring
  • Collaboration
  • Network
  • Microsoft Windows
  • Knowledge Management
  • Inventory
  • ESD
  • Regulatory Compliance
  • Mentorship
  • Server Hardware
  • HPC
  • GPU
  • SAS
  • Serial ATA
  • Storage
  • Computer Networking
  • InfiniBand
  • Firmware
  • Linux
  • Hardware Troubleshooting
  • Log Analysis
  • Dell
  • EMC
  • Lenovo
  • IPMI
  • BMC
  • Computer Hardware
  • Schematics
  • Documentation
  • Organizational Skills
  • Management
  • Repair
  • Computer Science
  • Information Technology

Summary

Position Summary: The Infrastructure Support Engineer is a critical hands-on role responsible for the deployment, maintenance, diagnosis, and repair of advanced server infrastructure within our HPC data centers. This person is expected to be a subject matter expert on server hardware, capable of independently diagnosing complex failures and executing repairs on cutting-edge equipment. In addition to direct on-site work, this role involves coordinating and guiding remote repair efforts across multiple data center locations, serving as the technical authority during repair windows.

Key Responsibilities:
Advanced Hardware Repair and Maintenance: Perform complex, hands-on diagnosis and repair of high-end server hardware including multi-GPU compute nodes, NVMe storage arrays, high-speed networking equipment, and custom HPC configurations.
Fault Diagnosis and Root Cause Analysis: Diagnose hardware and firmware-level failures using advanced diagnostic tools, system logs, OEM utilities, and first-principles reasoning. Document findings and drive root cause analysis to prevent recurrence.
OEM and Vendor Management: Manage relationships and repair workflows with OEM partners and hardware vendors. Navigate vendor portals, escalate cases effectively, and hold vendors accountable to SLAs to minimize infrastructure downtime.
Linux System Administration: Utilize Linux command line tools for in-depth system diagnostics, hardware validation, log analysis, performance monitoring, and pre/post-repair verification.
Cross-Team Collaboration: Partner with Infrastructure and Network Engineers to understand system dependencies, plan maintenance windows, and ensure repaired or newly deployed hardware integrates cleanly into production environments.
Documentation and Knowledge Management: Maintain detailed records of repairs, failure patterns, parts inventory, and procedures. Develop and refine internal runbooks and repair guides to support consistency and institutional knowledge.
Safety and Compliance: Enforce and model strict adherence to data center safety protocols, ESD procedures, and operational standards. Ensure all on-site and remote repair activities meet compliance requirements.
Mentorship: Provide technical guidance to junior technicians and on-site personnel, elevating the overall capability of the team.

Qualifications:

4+ years of hands-on experience in server hardware repair and maintenance, with at least 2 years working on enterprise or HPC-class equipment.
Deep expertise with high-end server components including multi-GPU nodes, NVMe and SAS/SATA storage, high-speed networking (InfiniBand, 100GbE+), and enterprise power infrastructure.
Proven ability to independently diagnose and resolve complex hardware and firmware failures without escalation.
Experience coordinating repairs remotely, including directing on-site technicians and managing OEM field engagements.
Strong working knowledge of Linux for hardware diagnostics, log analysis, and system validation.
Familiarity with OEM diagnostic tools and vendor support portals (Dell EMC, Lenovo XClarity, Supermicro, or similar).
Experience with IPMI, BMC, and out-of-band management tools for remote hardware monitoring and intervention.
Ability to read and interpret system schematics, technical manuals, and OEM service documentation.
Strong organizational skills and the ability to manage multiple simultaneous repair cases across different sites.
Bachelor's degree in Computer Science, Information Technology, or a related field preferred equivalent hands-on experience will be given strong consideration.

What We Offer:
The opportunity to work on some of the most advanced computing infrastructure in the industry.
A technically demanding environment where deep expertise is recognized and rewarded.
Competitive compensation and comprehensive benefits package.
An inclusive, collaborative culture that values technical excellence and initiative.
Clear pathways for growth into senior infrastructure engineering roles
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
  • Dice Id: cxbcsi
  • Position Id: Job44625
  • Posted 1 day ago
Create job alert
Set job alertNever miss an opportunity! Create an alert based on the job you applied for.

Similar Jobs

Hybrid in Charlotte, North Carolina

14d ago

Easy Apply

Full-time

90,000 - 100,000

Charlotte, North Carolina

Today

Easy Apply

Full-time

USD 55.00 - 60.00 per hour

Charlotte, North Carolina

Today

Full-time

USD 106,209.00 - 169,934.00 per year

Charlotte, North Carolina

Today

Full-time

Search all similar jobs