Crowdstrike / Microsoft outage

On Friday July 19, 2024, an untested update to the Crowdstrike Falcon software was pushed to all customers and caused Windows-based computers to panic AKA “Blue Screen of Death.” This can happen when a null pointer is dereferenced in kernel space, which is more privileged than user space on the operating system. According to one account on YouTube, the update included a file that had zeroes for memory addresses instead of the correct pointer location. This software is written in C++, a language which requires programmers to explicitly handle memory. Java and most modern languages handle the memory for the programmer implicitly at run time, but these are not designed for low-level applications. Rust is a newer option that checks memory safety at compile time.

How it blew up the world

This update was in a content file, so it was not due to any code change. There was likely some breakdown in the QA or deployment process. It’s possible a file was mis-named or a naming collision occurred somewhere along the line. Crowdstrike allows IT admins to set rings for deployment, so that the first update only goes to a test system. This is usually known as a canary deployment. But this is Windows and sometimes organizations have small numbers of systems. They may not have large IT organizations willing or able to do rolling deployments on their own. This is one flaw in their system–forcing customers to perform their own defense against bugs.

As it turns out, content updates ignore the configurable deployment ring system for Crowdstrike Falcon. They just get deployed everywhere all at once. It was determined that these updates are low-risk and happen on a daily basis. As it turns out this update exposed an existing problem that was not tested.

How to avoid critical deployment bugs

First the ring system should be enabled for all updates, not just code changes. At least give IT departments a chance to test before rolling out to every system.
Crowdstrike should roll out their own canary deployments. They need to do testing in the field in such a way that it will not go out to every customer, just a select group.
Testing–it goes without saying that their test matrix needs to be up to date with what is most likely in the field. This was an issue with the McAfee update that did something similar 14 years ago.
Automated testing integrated into the build process.
- Without knowing what their CI/CD process is like, they should have automated acceptance tests in the pipeline that is creating the file that goes out to customers.
- Maybe their pipeline was only triggered on code changes and not running on content files. Are the content files even kept in source control?
- Run acceptance tests periodically, instead of only on code changes. If you’re releasing files once a day, you might want to do this 4 times a day or every hour.
- Add manual regression testing (most companies are already doing this anyway, but it doesn’t cover all the bases.)

If a company that creates the kind of software that is this mission-critical have this issue, then these kind of QA problems must be widespread in the industry. There’s always some way to improve your process.You don’t have to rely on superstar engineers to detect problems. Organizations just need to set up the right multiple layers of defense to avoid big issues in field deployment failures.

AssertOnce