News

Inside the 78 Minutes That Took Down Millions of Windows Machines

The CrowdStrike Crisis and Lessons in Tech Disaster Management

Shortly after midnight in New York, a digital disaster began to unfold across the globe. By the time morning arrived, millions of Windows computers had crashed, leading to significant disruptions in various sectors, including aviation, media, and healthcare.

On Friday morning, chaos erupted as businesses worldwide struggled with system crashes. In Australia, shoppers were met with the notorious Blue Screen of Death (BSOD) at self-checkout aisles. In the UK, Sky News had to suspend its broadcast after servers and PCs failed. Airports in Hong Kong and India saw check-in desks malfunction, causing delays and confusion. As New Yorkers woke up, it became clear that millions of Windows machines had succumbed to a massive tech disaster.

In the early hours of the outage, confusion reigned. How could so many Windows machines suddenly crash? “Something super weird happening right now,” wrote Australian cybersecurity expert Troy Hunt on X. On Reddit, IT admins raised alarms in a thread titled “BSOD error in latest CrowdStrike update,” which quickly garnered over 20,000 replies.

The crisis hit major airlines in the US, grounding flights and preventing workers in banks, hospitals, and other institutions in Europe from logging into their systems. The root cause was soon identified: a faulty update released by cybersecurity company CrowdStrike at 12:09 AM ET on July 19th.

CrowdStrike’s Falcon security software is widely used to prevent malware, ransomware, and other cyber threats. The company regularly issues silent updates to provide the latest protections, but this particular update exposed a critical flaw in the software, leading to catastrophic consequences.

The Falcon software operates at the kernel level in Windows, giving it unrestricted access to system memory and hardware. This access makes the software highly effective at detecting threats but also capable of causing significant problems if something goes wrong. As Patrick Wardle, CEO of DoubleYou and founder of the Objective-See Foundation, explains, “When an update comes along that isn’t formatted correctly or has malformations, the driver can ingest that and blindly trust that data.”

On Friday morning, the faulty update caused a memory corruption problem, leading to widespread system crashes. “Where the crash was occurring was at an instruction where it was trying to access some memory that wasn’t valid,” Wardle says. “If you’re running in the kernel and you try to access invalid memory, it’s going to cause a fault and that’s going to cause the system to crash.”

CrowdStrike quickly identified the issue and issued a fix 78 minutes after the original update went out. IT admins worked tirelessly to reboot machines and apply the update before CrowdStrike’s driver could kill the systems. However, many support workers had to manually visit affected machines to delete the faulty content update.

Investigations suggest that a dormant bug in the driver likely caused the problem. The driver may not have been validating data from the content update files properly, which was not an issue until the problematic update on Friday. Wardle believes the driver should be updated to include additional error checks to prevent future crashes.

CrowdStrike’s failure to catch this issue earlier highlights the importance of rigorous pre-deployment testing. Standard practice involves gradually rolling out updates to test for major problems before they reach the entire user base. Proper testing could have identified the underlying driver problem, preventing a global tech disaster.

Although Microsoft was not directly responsible for the disaster, the way Windows operates allowed the entire OS to fail. The widespread BSOD messages initially led many to believe it was a Microsoft outage. Moving forward, preventing another CrowdStrike situation will require collaboration between software developers and operating system providers to ensure robust error-checking mechanisms are in place.

As the digital world becomes increasingly interconnected, the importance of meticulous testing and crisis management cannot be overstated. The CrowdStrike incident serves as a stark reminder of the potential consequences of even the smallest oversight in software updates.

Leave a Reply

Your email address will not be published. Required fields are marked *