CrowdStrike has blamed a bug in its own test software for the mass-crash-event it caused last week.
A Wednesday update to its remediation guide added a preliminary post incident review (PIR) that offers the antivirus maker's view of how it brought down 8.5 million Windows boxes.
The explanation opens by detailing that CrowdStrike's Falcon Sensor ships with "sensor content" that steers and defines its threat-detection engine's capabilities. The software is also updated with "rapid response content" that allows it to detect and handle emerging malware and other unwanted system activity. This rapid response content is delivered to users in those channel files you've been hearing about.
The base sensor content includes what's called "template types," which are blocks of code that can be used by rapid response content to identify malicious stuff on a system. As such, rapid response content is delivered in those channel files as so-called "template instances," which CrowdStrike describes as "instantiations of a given template type." Thus, the rapid response content relies on template code defined by the base sensor content, and each piece of this response content is a template instance.
Next, each of these template instances maps to specific system behaviors for the sensor software to observe, detect or prevent.
In February 2024, CrowdStrike introduced a new "inter-process communication (IPC) template type" for rapid response content to use that the vendor designed to detect "novel attack techniques that abuse Named Pipes." The IPC template type passed testing on March 5, and a rapid response template instance was released to use it.
Three more IPC template instances were deployed between April 8 and April 24. All ran without crashing 8.5 million Windows machines – although, as we reported earlier this week, Linux machines had problems with CrowdStrike in April.
On July 19, CrowdStrike introduced two more IPC template instances. One included "problematic content data," but made it into production anyway, because of what CrowdStrike described as "a bug in the content validator."
The post doesn't detail the content validator's role; we'll assume it's supposed to do what the name suggests.
Whatever the validator does or is supposed to do, it did not prevent the release of the dodgy July 19 template instance despite it being a dud. That happened because CrowdStrike assumed that tests that passed the IPC template type delivered in March, and subsequent related IPC template instances, meant the July 19 release would be OK.
History tells us that was a very bad assumption. As we concluded in our earlier analysis of the crash, Falcon loaded and parsed the new content and became confused by the broken template instance, which "resulted in an out-of-bounds memory read triggering an exception" within CrowdStrike's Windows driver-level code, which would bring down the whole box. On reboot, it would crash all over again.
"This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash," Team CrowdStrike said.
On around 8.5 million machines.
The incident report includes promises to test future rapid response content more rigorously, stagger releases, offer users more control over when to deploy it, and provide release notes.
You read that right: Release notes. Be still your beating heart.txt.
The report also includes a pledge to release a full root cause analysis once CrowdStrike has finished its investigation.
Take all the time you want: Some of us are still busy rebuilding machines you broke. ®
Source: The register