Cloudflare Outage: A Rights Management Error with Far-Reaching Consequences

The severe Cloudflare outage was caused not by an attack, but by an internal error. Cloudflare has now explained this in detail.

listen Print view
Cloudflare logo on a building

(Image: Tada Images(Shutterstock.com)

3 min. read

For the far-reaching outage at Cloudflare on Tuesday, a change to the access rights for an internal database was responsible, which ultimately caused a file to become too large due to too many entries. This oversized file was then distributed across the entire Cloudflare network, where it caused dependent software to crash. The internet service provider explained this in a detailed blog post. According to the post, it was a central file for the part of the systems responsible for detecting and rejecting automated (“bot”) requests. This is why not all services that use Cloudflare were affected: if a service does not rely on the affected software for bot rejection, its page remained accessible.

As Cloudflare explains, the system that was paralyzed is a technique that assigns a score to each access, using methods such as machine learning. This “bot score” indicates the probability of an automated request. It is calculated, among other things, based on precisely that file that suddenly became too large. It is supposed to contain characteristics of requests that help in the assessment. As a result, the scores were calculated incorrectly, indicating far too many automated accesses. If customers wanted to block such accesses, they suddenly became less accessible. Others were not affected.

According to the blog post, personally published by Cloudflare CEO Matthew Prince, the problem resolution was complicated by an unfortunate coincidence: the status page, hosted completely independently of the actual Cloudflare technology, went offline at approximately the same time. Therefore, they initially thought that a massive attack on Cloudflare was responsible for the outages. However, there was no connection at all. Prince himself speculated in an internal chat that one of the current largest botnets might have been showing off its strength here. Microsoft recently disclosed a record attack on its infrastructure.

Videos by heise

According to the description, the issues began on Tuesday at 12:28 PM CET and were investigated for an hour and a half. Shortly after 2:30 PM, they were able to focus on the actual cause, and an hour later, they stopped overwriting the excessively large file. Minutes later, the issue was internally resolved, and a correct file was distributed to the systems. The problem was finally resolved shortly after 6 PM CET, almost six hours later. Prince apologizes for this. Given Cloudflare's importance to the internet, any outage is “unacceptable”: “We know we disappointed you today.”

Cloudflare's infrastructure is intended to make websites and applications faster, more secure, and more stable. The US service is particularly known for its DDoS protection. Its technology thus protects against mass requests that can cripple websites. Because numerous services rely on it, Tuesday's outage also had far-reaching consequences for a wide variety of online offerings. Among the services that were unavailable were the microblogging services X and Truth Social, but AI services like ChatGPT and Perplexity also ceased to function. Furthermore, the error also affected major platforms like ikea.com and even individual media outlets.

(mho)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.