Incident Response Manager, Data Center - Ashburn

  • Posted 2 days ago | Updated 8 hours ago

Overview

Full Time

Skills

Cloud Computing
Attention To Detail
Security Operations
Network Monitoring
Swift
Recovery
Optimization
Change Management
IT Service Management
Standard Operating Procedure
Reliability Engineering
Collaboration
Leadership
Incident Management
Computer Science
Information Technology
IT Infrastructure
Network Operations
ROOT
Nagios
Communication
Data Centers
Problem Management
FOCUS
Continuous Improvement
Operational Excellence
System Monitoring
Grafana
ITIL
CompTIA
Server+
Cisco
Network
Cisco Certifications
PMP
Data Analysis
Visualization
Operational Efficiency
Management
Adaptability
Conflict Resolution
Problem Solving
Video
LOS
Recruiting

Job Details

Responsibilities

About the team The Data Systems Infrastructure (DSI) team sits within the global technology structure and supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and are reliable. Job Description We are seeking a technically skilled and detail-oriented professional to serve as a front-line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will possess a solid foundation in IT, infrastructure, or engineering disciplines, with experience in critical environments and the ability to analyze incidents, identify patterns, and drive long-term improvements. This role requires composed performance under pressure, data-driven thinking, and a proactive approach to continuous improvement and operational resilience. Responsibilities - Serve as the first responder in the IRC Operation Center, detecting and responding to events across infrastructure, facilities using tools such as Server Automation, Data Center Infrastructure Management, Network monitoring, Grafana, and related systems. - Respond promptly to events including but not limited to: - Environmental systems (e.g. high temperature, humidity, power fluctuations or failures) - IT infrastructure (e.g. server performance issues, network outages, system failures) - Facility and environmental alerts relevant to operations. - External Facing Services (e.g. colocation maintenance notices, service requests from CDN partners, and critical notifications) - Conduct detailed investigations to diagnose the root cause of events, assess their impact, and determine appropriate response actions. - Monitor and analyze detected events, accurately classify incidents based on potential or actual customer impact, and proactively communicate risks. Coordinate timely escalations by notifying and collaborating with relevant support teams to ensure swift incident resolution. - Monitor incident response performance against agreed SLAs, ensuring timely alerts and notifications. - Manage incidents calmly and efficiently, performing in-depth investigations to determine root causes and impacts, while promptly engaging and coordinating with the designated resolver teams to facilitate timely resolution. - Draft detailed incident reports and conduct post-mortem reviews to document lessons learned. - Generate regular reports to deliver comprehensive insights into the effectiveness of incident response and recovery processes. - Analyze trends and patterns in events to identify opportunities for improvement and optimization - Own and drive the Incident, Problem, and Change Management processes in alignment with ITIL or internal ITSM frameworks. - Develop and maintain a comprehensive library of Standard Operating Procedures (SOPs), Methods of Procedure (MOPs), runbooks, and operational guides to ensure consistency and readiness across teams. - Lead or support continuous improvement projects aimed at enhancing incident response capabilities, operational security, system reliability, and overall infrastructure performance. Collaborate with cross-functional teams to implement engineering solutions and process optimizations. - Provide technical and operational leadership to the incident response center team, ensuring consistent performance and adherence to best practices.

Qualifications

Minimum Qualifications - Bachelor's degree in Computer Science, Information Technology, Engineering, or a related technical field. - Strong technical background with prioritized experience in Data Center Facility Operations Center (DC FOC) management. Experience in IT infrastructure, network operations, or systems monitoring is also desirable. - Proven ability to analyze complex systems, investigate incidents, and identify root causes effectively. - Familiarity with monitoring and alerting tools such as Grafana, Nagios, or similar platforms. - Experience in incident and problem management processes, with the ability to drive corrective actions and coordinate cross-functional teams. - Excellent troubleshooting skills and the ability to work calmly under pressure during critical incidents. - Strong communication skills to draft reports, conduct reviews, and liaise with technical and non-technical stakeholders. Preferred Qualifications - 5 years of experience in IT environments-such as data centers or enterprise systems-combined with hands-on incident and problem management experience. - Proactive mindset with a focus on continuous improvement and operational excellence. - Proven ability to perform effectively under pressure and within tight time constraints to resolve issues and meet deliverables. - Hands-on experience with ticketing systems, monitoring tools such as Grafana, server infrastructure, and data center systems. - Working knowledge and/or certifications in one or more of the following: - ITIL Foundation - CompTIA Server+ - Schneider Electric Data Center Certified Associate (DCCA) - Cisco Certified Network Associate (CCNA) - Project Management Professional (PMP) - Data Analytics and Visualization tools or methodologies - Demonstrated experience in driving or contributing to improvement projects focused on operational efficiency, security enhancements, or infrastructure reliability. - Ability to manage multiple tasks and projects, ensuring timely delivery and alignment with organizational goals. - Strong adaptability and problem-solving skills in ambiguous and rapidly changing environments. - Willingness to be on call during weekends, nights, and holidays.

Job Information

About TikTok

TikTok is the leading destination for short-form mobile video. At TikTok, our mission is to inspire creativity and bring joy. TikTok's global headquarters are in Los Angeles and Singapore, and we also have offices in New York City, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo.

Why Join Us

Inspiring creativity is at the core of TikTok's mission. Our innovative product is built to help people authentically express themselves, discover and connect - and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and bring joy - a mission we work towards every day.

We strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. Every challenge is an opportunity to learn and innovate as one team. We're resilient and embrace challenges as they come. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our company, and our users. When we create and grow together, the possibilities are limitless. Join us.

Diversity & Inclusion

TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.

TikTok Accommodation

TikTok is committed to providing reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.