Microsoft and CrowdStrike Outage Fail: Lesson for Tech Pros

Nearly a month after a faulty CrowdStrike software update crashed 8.5 million Microsoft Windows machines worldwide, leaving businesses and individual users paralyzed, the fallout continues to reverberate with threatened lawsuits, insurance payout concerns and handwringing about an overreliance on a few large vendors for major IT services.

What appeared at first to be a large-scale cybersecurity incident on a summer Friday turned out to be one of the largest IT outages ever recorded.

The incident started on July 19 when CrowdStrike pushed a software update to its customers. The cybersecurity firm later described it as an updated configuration file to the threat detection component of its Falcon endpoint detection and response platform. The results from that update meant that many Windows machines were stuck in a loop or crashed with the infamous Blue Screen of Death, a.k.a. BSOD. (Linux and Mac machines were not affected.)

The outage sent support teams from Microsoft and CrowdStrike (as well as IT professionals at organizations affected by the incident) into overtime during that weekend and the days that followed to restore services, a hefty manual process.

By July 25, CrowdStrike estimated that about “99 percent of Windows sensors” had been restored. A preliminary investigation noted that while the initial tests of the update proved safe, one of the files distributed on July 19 "passed validation despite containing problematic content data," according to the initial assessment. (A more thorough investigation by CrowdStrike is ongoing.)

While most day-to-day operations have returned to normal, the effects of the outage are still being felt weeks later. For instance, cybersecurity risk analytics platform CyberCube published a report that concluded that insured losses from the outage would likely range from $400 million to $1.5 billion.

There are also the threats of lawsuits. Delta Air Lines, which was hit particularly hard by the outage, has loudly and publicly blamed CrowdStrike and Microsoft. In comments to CNBC, Delta CEO Ed Bastian alleged the incident cost the airline approximately $500 million in losses and more than 40,000 Windows-based servers within the company needed to be reset. CrowdStrike has denied the charges of negligence and Microsoft noted that the airlines declined its help.

For many, the outage holds lessons for the tech and security professionals who must deal with these and similar issues. For many, it’s also about cyber resilience and ensuring that platforms and systems can recover and function following a failure, whether malicious or accidental.

“This incident has reinforced our belief that while the security industry has traditionally focused on preventing and mitigating cyber risks, we need a more comprehensive approach: cyber resilience,” Raju Chekuri, CEO and chairman at security firm Netenrich, recently told Dice. “Cyber resilience isn't just about security—it's also about availability and performance. It's about managing the full spectrum of digital risks that can lead to business disruptions, financial losses, reputation damage and intellectual property theft.”

Summary

Building Cyber Resilience Into Recovery Plans
Beware Single Points of Failure
Beware Automation and Check Your Vendors

Building Cyber Resilience Into Recovery Plans

For years, organizations such as the National Institute of Standards and Technology (NIST) have published guidelines and documents about the importance of cyber resiliency within systems and platforms. While most of the discussions have focused on recovering systems and data following a malicious attack, experts note the lessons can also be applied to outages such as the CrowdStrike and Microsoft incident.

“In our interconnected digital world, cyber resilience strengthens us all. While today's headlines might focus on a single incident, the underlying message speaks to our collective need for a more resilient digital future,” Chekuri added. “Let's come together as an industry. Let's push forward, improve our practices, and build a digital world that's not just secure but truly resilient.”

For others, thinking about cyber-resilient systems also means considering which various platforms and processes are interconnected and relying too heavily on a small number of vendors to provide services and support.

“This incident exposes the dangers of a monoculture in IT environments,” Tamir Passi, senior product director at DoControl, told Dice. “Organizations heavily reliant on a single vendor's ecosystem were hit harder than those with diverse systems. It's a wake-up call to consider a more heterogeneous approach, mixing different solutions to create resilience through diversity.”

Beware Single Points of Failure

The increasing reliance on cloud-based systems and SaaS, including the automated software updates that come with these applications, also factor into the resilience conversations. While cloud-based platforms provide flexibility for organizations, they can amplify problems when there is a single point of failure such as with the flawed CrowdStrike update.

“The shift to cloud computing and consolidation of vendor products increases dependencies on fewer providers, amplifying the impact of any single point of failure,” Callie Guenther, senior manager for cyber threat research at Critical Start, told Dice. “While this can streamline operations and enhance integration, it also creates systemic risks where outages or security breaches can have widespread effects. This trend underscores the need for robust contingency planning, diverse vendor strategies and continuous monitoring to mitigate risks.”

Other experts also note that consolidation and reliance on cloud-based systems (especially since the pandemic that increased remote work) introduce single points of failure and have other downsides that IT and security professionals must understand.

“Today, we're seeing a significant shift towards cloud-based solutions, and the consolidation of security vendors offers substantial benefits but also creates potential single points of failure,” Netenrich’s Chekuri said. “The writing is on the wall: as an industry, we must evolve strategies to balance these trends with robust, multi-faceted risk mitigation approaches.”

Beware Automation and Check Your Vendors

The CrowdStrike and Microsoft outage also underscores that many cloud-based systems and SaaS applications receive automatic updates, and tech and security professionals overseeing large, complex networks must ensure these updates' integrity before rolling them out across the enterprise.

This approach requires effective coordination among IT teams, including checks for data integrity and a cautious patch deployment plan. Resource allocation and clear user communication further complicate the process, necessitating a multifaceted approach to resolve the issue efficiently, said Jason Soroko, senior vice president of product at Sectigo.

“Proper testing practices should have included comprehensive testing in varied environments, staged rollouts to detect issues early, and automated regression testing to prevent new bugs,” Soroko told Dice. “The move to cloud services itself isn’t inherently risky, and in this case, the problem was with a corrupted agent file and not with the cloud service itself.”

With an increasing emphasis on automating more and more processes, tech and security teams need to understand how to properly implement automation tools to avoid costly mistakes.

“While automation can enhance efficiency, it can also amplify mistakes if not properly implemented. The key is finding the right balance between automated processes and human oversight,” DoControl’s Passi said. “Importantly, always stage changes in your environment first. Test thoroughly before rolling out updates to your production systems, even if they come from trusted vendors.”

Passi added that this approach requires tech professionals to better manage their vendor relationships by understanding their processes. “Know your vendors inside and out—their development practices, quality control measures, incident response plans, and how they handle your data. Don't be afraid to ask tough questions and demand transparency,” he noted.

Microsoft and CrowdStrike Outage Fail: Lesson for Tech Pros

Summary

Building Cyber Resilience Into Recovery Plans

Beware Single Points of Failure

Beware Automation and Check Your Vendors

Dice Staff

Related Articles

'Tech Connects' Podcast: Dealing with the Cybersecurity Job Gap

Cybersecurity Hiring Managers Are Looking For More Than Tech Skills

AWS Expands AI, ML Certifications