One year after the Crowdstrike disaster: Have the right conclusions been drawn?
A year ago, a broken Crowdstrike update paralyzed millions of Windows systems. A look at what has happened in the IT world since this debacle.
(Image: CLS Digital Arts/Shutterstock.com/Bearbeitet durch iX)
Saturday marks the anniversary of the Crowdstrike debacle. A good opportunity to look back. What happened? On July 19, 2024, Crowdstrike rolled out a faulty update to all endpoint security agents with Crowdstrike's EDR system Falcon, which crashed millions of Windows computers. To make matters worse, many systems were so badly affected that manual intervention was required on site to make them usable again.
This meant that some admins actually had to go to thousands of Windows computers, which sometimes took days. Even conservative estimates put the damage caused by this at many billions of US dollars.
Investigating the causes
The main cause was Crowdstrike's poor quality assurance when distributing such critical updates to millions of systems. Crowdstrike practically did not test this update in advance and distributed it to all systems in one fell swoop. It is customary to do this in several phases. And finally, they reacted far too late and inadequately to the crashes that occurred almost immediately. [Update: According to Crowdstrike, it stopped the further distribution of the updates after just one hour and 19 minutes and withdrew them as far as possible. However, this did not restore the systems that were already caught in a reboot loop].
In addition, the Windows system architecture allowed such a channel update to cause a memory corruption in the kernel. This was because considerable parts of the security software ran directly as part of the operating system kernel, which caused not just one program but the entire system to crash in the event of errors. This could have been solved better technically – also under Windows.
Speaking of which, with Windows, the problem affected an operating system that itself was not designed to be resilient enough. The Linux kernel, for example, offers eBPF (extended Berkeley Packet Filter), an interface via which security software can access kernel resources without operating in kernel mode itself. The aim is to keep potentially less reliable code out of the core of the system. Windows, on the other hand, lacked both support and rules for accessing the kernel. Therefore, all manufacturers – including Crowdstrike – operate with their drivers even in the Windows kernel. And when loading this Crowdstrike driver, Windows crashed.
This happened so early in the boot process that users had no chance to intervene; their system was trapped in an endless reboot loop. The obvious idea of logging the boot process and offering a system start without the problematic driver at the third crash at the same point had obviously not occurred to Redmond. This simple measure would have significantly reduced the impact of the problem.
Do you already know the free iX newsletter? Register now and don't miss a thing on the monthly publication date: heise.de/s/NY1E. The next issue will focus on the cover topic of the August iX: modern test management.
Improved
Meanwhile, the responsible parties have not only vowed to do better but have also already implemented concrete measures to prevent such a super-GAU in the future. First and foremost Crowdstrike, which now wants to make its systems and their operation “resilient by design”. This initially involves eliminating the aforementioned shortcomings in testing and rolling out updates.
The aim is to find and eliminate errors before they affect customers' systems. And customers can now even decide for themselves whether they want to receive signature and sensor updates immediately or with a time delay. But of course this also harbors risks: the sensor then only recognizes very recent attack patterns with a delay. Crowdstrike aims to minimize the criticized kernel usage. And finally, a “Chief Resilience Officer” as a staff position directly under the CEO is to increase resilience against new failures.
Reliability is also back on the agenda at Microsoft. As part of the Windows Resiliency Initiative, the boot process has been expanded to include Quick Machine Recovery, among other things. This should make it possible in the future to get Windows systems with boot problems up and running again remotely. Microsoft is also largely banning security software from the kernel, and the Microsoft Virus Initiative (MVI) obliges security providers to carry out more comprehensive tests and staggered rollouts of their updates.
However, these measures have also provoked criticism. Unlike Linux or the fenced-in garden of Apple Security, Microsoft itself is a provider of commercial security products. And there are already more and more voices in the security industry saying that Microsoft is exploiting its position to gain further advantages for its security software. It would not be the first time that Redmond has used Windows to obstruct the competition and create further monopolies. In the future, we will have to pay very close attention to what is actually technically sensible and necessary and where the distortion of competition begins.
Speaking of monopolies and monocultures
It is often said that the crowdstrike fiasco was the result of a monoculture. I would disagree with that. Although Crowdstrike is one of the major providers in the security market, it is by no means a monopolist; there are SentinelOne, Trend Micro, Palo Alto, and, finally, Microsoft itself with its Defender for Endpoint as valid alternatives. Windows is a monoculture and therefore also a single point of failure. But if Windows only ran on 30 percent of systems instead of 70 percent and shared the market with several similarly strong competitors, the Crowdstrike fail would still have caused billions in damage. Microsoft's quasi-monopoly has therefore at best exacerbated the problem but by no means caused it.
In my opinion, however, the question of how it was possible for a large, renowned manufacturer such as Crowdstrike to be so sloppy in terms of quality assurance is neglected in the discussion. This is primarily due to the lack of consequences. Crowdstrike's share price may have plummeted in the short term, but it has long since risen again. The damage caused by the failures is not borne by those responsible, i.e., Crowdstrike or Microsoft, but by their corporate customers and, in turn, their customers, who were stranded in droves at airports, for example. The ongoing claims for damages, such as that of Delta Airlines, will probably only result in a few easily bearable millions at best.
Videos by heise
Even a year after Crowdstrike, there has been no significant change in the systematic outsourcing of the risk arising from sloppiness to the company. Security and reliability primarily generate costs for manufacturers, which they can save without any direct loss of sales. If you take it too far and it goes wrong, then you do some penance – and continue in the same way a few years later. As long as we do not make manufacturers and IT providers liable for grossly negligent failures in the area of security and reliability, such incidents will continue to occur.
After the article was published, Crowdstrike contacted us and asked us to point out the following: "Crowdstrike reacted immediately after the problem occurred and stopped and recalled the faulty updates. Furthermore, as part of the Resilient by Design initiative, we are taking a number of measures to prevent such problems in the future." We have also added a note from Crowdstrike in the text about the response time after the problems occurred.
(ju)