The crowdstrike fiasco: root causes and initial lessons learned
After perhaps the biggest outage in IT history, Jürgen Schmidt analyzes what exactly went wrong - and, above all, what can be done better in the future.
A faulty update for Crowdstrike's agent software caused around 8.5 million Windows PCs to crash worldwide - many of them in production environments in companies. The error was so persistent that a restart was not possible: Windows kept getting stuck in the same place. The problem is already considered by many to be the biggest failure in IT history.
After the initial horror and the subsequent efforts to get the systems up and running again, the stocktaking slowly begins: What were the causes? Why did it have such a devastating effect? Can it happen again? And what can companies or the manufacturers involved do to prevent this from happening again? That's what we'll be looking at below; at the very end, we'll also provide a few practical tips that all company admins should take to heart.
The update problem
It is important to understand that this was not a classic software update. Admins could have tested this first and then rolled it out in stages to avoid such widespread failures. Rather, Crowdstrike delivered a new detection pattern in response to new threats, which was intended to detect and prevent certain tricks with named pipes. According to unconfirmed rumors, the trigger was a new feature in the Cobalt Strike attack framework, which criminals also like to use. Its manufacturer had introduced new functions based on named pipes shortly before the fatal Crowdstrike update.
Crowdstrike does offer an option to force the software as such to an older version in order to allow others to test new functions. However, this only affects the actual software. The signatures delivered in so-called channel files are always installed in the latest versions. These signature updates appear daily, sometimes even hourly, and they are delivered and activated directly without any further control instance. This is due to the urgency of reacting to threats as quickly as possible and is therefore common practice in the security industry. In this respect, users are largely at the mercy of the manufacturer's diligence. It is important to note that these updates can also contain commands such as "Delete this file", which the crowdstrike agent executes. (This perhaps makes the BSI's warning against the use of Kaspersky software understandable in retrospect).
Windows as an "open" system
In addition, antivirus software and nowadays even more EDR software is already very deeply integrated into the system in order to detect and then prevent malicious activities by malware itself. Almost every EDR installs drivers with kernel rights under Windows. This often leads to complex and error-prone software being executed in kernel mode. And if a kernel component crashes due to an error in the code, for example, the entire system comes to a standstill. Incidentally, the author of the Windows task manager Dave Plummer explains what kernel mode and its dangers are all about in an excellent video on X.
In addition to the risk of crashing, complex kernel drivers naturally also entail increased security risks. And the worst thing is: this often even affects users who do not use these drivers themselves. For example, the attack technique known as "bring your own vulnerable driver" has become established, in which attackers install a kernel driver that is known to be vulnerable, only to use it to target security software.
Apple, for example, shows that this can be done differently by strictly preventing such kernel drivers. Instead, they provide third-party security software with special interfaces through which they can perform their tasks without having to intervene deeply in the system themselves. Accordingly, security expert Kevin Beaumont, for example, is calling for Microsoft to introduce hard guard rails in Windows in order to limit the dangerous activities of the manufacturers.
The problem with this is that such restrictions further strengthen Microsoft's monopoly position and drastically reduce the opportunities for independent security providers to distinguish themselves with innovative products. Critics also fear that Microsoft would take advantage of this to further expand its already dominant position in the security market and let the competition starve on its outstretched API arm. In fact, Microsoft took its first steps in this direction over ten years ago, but was met with harsh criticism from competitors in the security camp and was ultimately rebuffed by the competition authorities.
However, there are certainly measures that Microsoft can take to improve the security and resilience of Windows without unfairly slowing down the competition. For example, Windows could monitor its own boot process and, if it detects that it repeatedly gets stuck at the same point, offer the user the option of booting without the problematic driver in order to resolve the problem. This would not be witchcraft, but solid software engineering and would have ensured that those affected could get the system up and running again with a simple keystroke. Windows can do this in principle, but allows exceptions for so-called boot-start drivers. Apparently Crowdstrike has marked its driver as such, on the assumption that it is better not to start Windows at all than without Crowdstrike's protection. Microsoft could take a more rigid approach in favor of resilience, but would then of course have to do the same for its own Defender drivers.
Microsoft could also offer better interfaces to mitigate the problem of faulty drivers. There are certainly security solutions that deliberately do without their own kernel drivers. Capsule8, for example, uses the kernel interface eBPF to gain deep insights into the monitored Linux systems. But eBPF for Windows is still in its infancy, explains security researcher Matt Suiche, explaining why this is currently not an option for Windows software. As Microsoft does not offer anything comparable, it is no wonder why at least all known security solutions for Windows rely on their own kernel drivers.
Rust FTW?
In the context of the Crowdstrike fiasco, there have also been increasing calls for a switch to "secure programming languages" such as Rust. This is indeed an important task that will hopefully now be given even higher priority. This is because Rust largely eliminates memory management errors, which have accounted for the majority of all security-relevant programming errors for decades. But even that only covers part of the problem. There is no such thing as a programming language in which it is impossible to make mistakes that lead to a crash. This is all the more true when a driver with unlimited omnipotence executes commands that were knitted with a hot needle as an urgent update.
Ultimately, the key to more resilience is to be seen in better quality assurance in general. Nevertheless, Microsoft should not only migrate its own drivers from C/C++ to Rust as quickly as possible, but also push ahead with projects such as windows-drivers-rs that enable others to do the same. However, this should not be expected to improve the situation quickly.
Tips for admins
After all the things that Microsoft and vendors like Crowdstrike could and should do, there are also at least two lessons that every administrator should take to heart: one quite specific and one rather general. Most of the cases in which it took a disproportionately long time to resume operations had to do with Bitlocker. This was because the workaround to remove the forced restart loop required deleting the update files in the Windows folder (more precisely: windows\system32\drivers\crowdstrike\c-00000291*
).
However, if the entire drive is encrypted with Bitlocker, these files cannot be accessed "from the outside". This is of course exactly the purpose of encryption. It ensures that a thief or finder of a company laptop, for example, cannot gain access to important company data. To gain this access independently of the installed Windows, you must first enter the recovery key. This also applies to Windows' own recovery mode, in which the problematic files can be deleted.
Anyone who messed up with the key management paid a high price when recovering from the Crowdstrike fail, which was also multiplied by the number of managed systems. Many production failures could be traced back to problems with obtaining and entering the required recovery keys. So the first lesson for admins: take this as an opportunity to think about where and how you store your Bitlocker recovery keys and how they should be accessed in various emergency scenarios.
The second tip has to do with emergency scenarios in a more general context: Many problems were due to the fact that nobody in the company had previously thought about how to react to large-scale IT failures. As a result, all measures had to be designed and tested under extreme stress and time pressure. Don't let it get that far. Plan in advance for various, not unlikely emergency scenarios, what options you have for responding and, above all, what you need to do so. And then practise this if possible. This is the only way to find the problems that will cause your great concept to fail in practice.
Discuss with the author and other security professionals in the heise security PRO forum
(ju)