Logging failed: Data loss at Cloudflare

An update has paralyzed Cloudflare's logging systems. The problem was resolved within minutes, but customers lost data for several hours.

Save to Pocket listen Print view
Laptop on the net is examined with a magnifying glass

Examination of the systems for anomalies

(Image: Bild erstellt mit KI in Bing Designer durch heise online / dmk)

2 min. read
By
  • Sven Festag

Cloudflare's cloud-based log management system did not transmit any data to customers for around three and a half hours. Around 55 percent of the logs were lost. The service provider's developers had previously made changes to the Logpush system. These proved to be faulty, so the developers installed an earlier version that fixed the problem. Although it only took five minutes to import the backup, the flood of data that occurred in the meantime paralyzed the systems for hours.

The Logpush service reads log data from a buffer and forwards it in bundles to destinations specified by the customer. Support for a new data set was to be introduced with the update. This requires the configuration of the Logfwdr service, which another system does automatically on a regular basis. Due to an error, Logfwdr received an empty configuration.

According to this configuration, customers had not set up any redirects and Logfwdr no longer received any log data. To prevent data loss, a backup function was triggered that forwarded all logs instead of the set logs. According to Cloudflare, the amount of data exceeded the storage space of the buffers by a factor of forty. The buffers were actually supposed to be protected against such an overload, but the configurations for this had not been completed. The systems were only fully operational again after a restart. Most recently, Microsoft also lost logging data.

Cloudflare acknowledges that errors are inevitable and that the systems must react to them predictably and without failures. To this end, the company intends to subject the systems to overload tests in future. There will also be warnings about misconfigurations that developers cannot overlook.

Details on the log service outage can be found in the Cloudflare blog.

(sfe)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.