Overview
Skills
Job Details
This role entails assisting with all projects and repairs within the data center, participating in an on-call rotation, and providing hands-on coverage during maintenance. The selected individual will be responsible for handling a variety of tasks, including solving operational issues, analyzing and designing operations to improve workflow, managing equipment layout, and ensuring accident prevention. They will support operations, including the physical layout of equipment, customer deployments, and ensuring the timely bring-up of GPU servers. Additionally, they will manage InfiniBand fabric bring-up, configuration, and subnet management on the IB switch, and will document existing operational processes and equipment.
The candidate should utilize a framework for monitoring tools, escalate key issues, and ensure timely service implementation. They will be diagnosing, troubleshooting, installing, and repairing all software, hardware, and components. Furthermore, they should be proficient in installing, configuring, and troubleshooting networking equipment like routers and switches, and have a good understanding of the OSI Model and TCP/IP protocol suite (IP, ARP, ICMP, TCP, UDP, SMTP, FTP, TFTP). Configuring Terminal Servers for out-of-band management, managing daily issues including health checks of servers and processes, and working closely with end-users, development teams, and Infrastructure teams to prioritize, resolve, and mitigate outages are also part of the responsibilities.
The role also involves server installation and maintenance, network installation and maintenance, site builds and refreshes while meeting current quality standards and interacting with onsite staff and vendors for hardware replacement, delivery, and diagnostics. Additionally, the candidate will perform operational tasks associated with data center implementation, migration, deployments, cabling, and rack and stack.
As for the requirements, the candidate should have experience with cluster bring-up, drivers, loading, and GPU end-to-end testing in a cluster with InfiniBand. They should also have experience with the setup of GPU servers in a cluster, proficiency in Linux environments, and tasks such as shell scripting. Strong skills in installation, configuration, and troubleshooting of Linux operating systems, experience in OpenStack cloud operations, and excellent data center organization skills with meticulous attention to detail are also required. Familiarity with fiber and copper network cabling, including IP and SAN deployments, and the ability to maintain acceptable ticket loads and incident SLAs, follow documented escalation procedures, and sync with global teams on various tasks and upcoming initiatives are essential.
Understanding and adhering to documented policies, processes, and procedures, assisting with process improvement initiatives, and documentation of policies, processes, and procedures, including runbooks, are also crucial. The candidate should be able to move 50+ pounds as well
Technical Recruiter
Xchange Software Inc
.
Phone:
Fax: