Position: Data Center Operations Engineer
Location: California City, CA, USA
Duration: 12+ Months (Contract)
Interview: Video Interview
Visa: Open (As per client requirement)
Job Description:
We are seeking a Data Center Operations Engineer with strong hands-on experience supporting enterprise data center infrastructure, Linux systems, GPU server deployments, and InfiniBand networking. The ideal candidate will have expertise in installing, configuring, troubleshooting, and maintaining data center hardware and infrastructure while supporting HPC/AI environments and ensuring high availability of critical systems.
This role requires excellent troubleshooting skills, experience with GPU cluster deployments, InfiniBand fabrics, Linux administration, networking, and data center operations. The engineer will work closely with infrastructure, operations, and engineering teams to support deployments, maintenance activities, and continuous operational improvements.
Required Skills:
- 5+ years of experience in Data Center Operations or Infrastructure Engineering.
- Strong hands-on experience with Linux system administration, troubleshooting, and performance validation.
- Experience with Linux command-line utilities and Bash/Shell scripting.
- Hands-on experience deploying and configuring GPU servers in clustered environments.
- Experience with GPU cluster bring-up, driver installation, and system-level configuration.
- Strong knowledge of InfiniBand networking, including switch configuration, subnet management, and troubleshooting.
- Experience performing end-to-end GPU testing in InfiniBand-based clusters.
- Solid understanding of networking fundamentals, including TCP/IP, OSI Model, ARP, ICMP, TCP, UDP, SMTP, FTP, and TFTP.
- Experience installing, configuring, and troubleshooting routers, switches, and terminal servers.
- Hands-on experience with server hardware installation, rack and stack, cabling, CPUs, memory, HDDs, RAID controllers, NICs, and firmware upgrades.
- Experience with fiber and copper cabling, IP networking, and SAN infrastructure.
- Experience supporting data center deployments, migrations, hardware refreshes, and expansion projects.
- Experience using monitoring and alerting tools to identify and resolve infrastructure issues.
- Experience working with ticketing systems while meeting SLA requirements.
- Strong documentation skills for operational procedures, system configurations, and technical runbooks.
- Excellent troubleshooting, communication, and organizational skills.
- Ability to work in a fast-paced production environment and participate in on-call rotations.
Preferred Skills:
- Experience supporting HPC, AI, or large-scale GPU environments.
- Experience with NVIDIA GPU platforms and Mellanox/InfiniBand technologies.
- Experience with data center monitoring solutions.
- Experience supporting large-scale data center build-outs and infrastructure refresh programs.
- Familiarity with automation or scripting for operational tasks.
Responsibilities:
- Provide operational support for data center deployments, maintenance, and repair activities.
- Install, configure, test, and maintain Linux servers and GPU infrastructure.
- Deploy, configure, and validate GPU servers and clustered environments.
- Perform InfiniBand fabric bring-up, switch configuration, subnet management, and troubleshooting.
- Install and maintain server hardware, including CPUs, memory, storage, RAID components, and network adapters.
- Configure and troubleshoot routers, switches, terminal servers, and out-of-band management devices.
- Perform daily health checks of Linux systems, networking, and infrastructure components.
- Support data center build-outs, hardware refreshes, migrations, and expansion projects.
- Coordinate with vendors for hardware installation, diagnostics, replacement, and warranty support.
- Monitor infrastructure using monitoring and alerting tools, ensuring timely incident resolution.
- Maintain operational documentation, technical procedures, and runbooks.
- Participate in incident response, maintenance windows, and on-call support rotations.
- Collaborate with cross-functional global teams to ensure reliable, secure, and scalable infrastructure operations.
Contact -