Company Overview:
We are a pioneering Infrastructure-as-a-Service (IaaS) company, focusing on delivering High-Performance Computing (HPC) solutions. Our cutting-edge data centers form the core of our operations, empowering us to offer unmatched computational resources to our global clientele. In line with our growth and the expansion of our services, we are on the lookout for an experienced and dedicated Infrastructure Operations Engineer to strengthen our team.
Position Summary:
The Infrastructure Operations Engineer is critical in maintaining and optimizing the infrastructure that powers our high-performance computing environments. This role encompasses proactive system monitoring, maintenance operations, and rapid response to infrastructure incidents. The successful applicant will work closely with cross-functional teams including Network Engineers, deployment teams, and customer support, ensuring maximum uptime and reliability of our infrastructure. Travel to Data Centers located within the US may sometimes be required to support critical maintenance, troubleshooting, or infrastructure upgrades
Key Responsibilities:
Maintenance and Support:
Perform daily infrastructure health checks and monitoring to ensure optimal system performance
Execute firmware updates and patch management across server infrastructure
Handle RMA processes and coordinate hardware replacements with vendors and on-site personnel
Respond to customer support escalations requiring backend infrastructure access, ensuring timely resolution
Document and execute standard operating procedures to maintain operational consistency
System Administration:
Maintain server configurations and automation scripts to streamline operations
Perform routine backup verification and restoration testing to ensure data integrity
Execute change requests following established approval processes and change management protocols
Monitor system performance and resource utilization, identifying optimization opportunities
Collaboration and Communication:
Work closely with Network Engineers, deployment teams, and customer support to resolve complex issues
Coordinate with OEMs and vendors through external portals for hardware support and replacements
Escalate complex technical issues to senior leadership when necessary
Participate in post-incident reviews and contribute to continuous improvement initiatives
Documentation and Process Improvement:
Maintain thorough documentation of infrastructure configurations, procedures, and incident resolutions
Contribute to the development and refinement of operational runbooks and knowledge base articles
Identify opportunities for automation and process optimization
Safety and Compliance:
Adhere to strict data center safety protocols and operational standards
Follow security best practices and compliance requirements for infrastructure access and maintenance
Participate in regular safety training and briefings
Qualifications:
Bachelor's degree in Computer Science, Information Technology, or a related field preferred
3-5 years of experience in infrastructure operations, system administration, or a similar role in enterprise or data center environments
Strong hands-on experience with server hardware components including drives, RAM, power supplies, network interfaces, and server chassis
Proven ability to diagnose and troubleshoot complex infrastructure issues using systematic methodologies
Advanced proficiency with Linux system administration including shell scripting, process management, log analysis, and system optimization
Experience with infrastructure monitoring tools and alerting systems
Familiarity with automation tools (Ansible, Puppet, Chef, or similar) preferred
Experience with ticket management systems and incident response workflows
Knowledge of IPMI, BMC, and out-of-band management tools
Understanding of storage systems, networking fundamentals, and HPC infrastructure preferred
Strong problem-solving abilities with demonstrated capacity to work independently and make sound decisions under pressure
Excellent organizational skills with ability to manage multiple priorities in a fast-paced environment
Effective communication skills, both written and verbal, with ability to explain technical concepts to non-technical stakeholders
Availability to participate in on-call rotation and travel occasionally to data center locations as required
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
- Dice Id: cxbcsi
- Position Id: Job44586
- Posted 3 hours ago