AI scrapers strain Wikipedia: 50 percent more bandwidth for multimedia requests

Wikipedia is one of the most popular websites, and it is actually prepared for peaks in traffic. However, AI scrapers make this more difficult.

listen Print view
Wikipedia on a tablet

(Image: Allmy/Shutterstock.com)

4 min. read

The online encyclopedia Wikipedia and associated libraries have registered a drastic increase in bandwidth for downloads of multimedia content in the past year and blame this on scrapers for training AI. This is according to a statement from the Wikimedia Foundation, which explains the associated difficulties. They are prepared for a sudden increase in interest in certain content, but the continuous access to all content and even content that has rarely been opened so far would bring its problems and put a particular strain on the infrastructure. We now need to work on prioritizing access by people.

The demand for multimedia bandwidth over time

(Image: Chris Danis, CC BY 4.0 )

The team uses a diagram and a few backgrounds to illustrate the challenge. It shows the bandwidth for multimedia downloads over time. While the general level has grown noticeably since spring 2024, the diagram shows several peaks over the course of the year. The highest peak falls in the period following the death of former US President Jimmy Carter. The organization explains that “some people” watched the one-and-a-half-hour video of a debate between Carter and Ronald Reagan to mark the occasion. The organization is actually well-prepared for such events, but in this case, there were noticeable loading times for some people for about an hour.

Videos by heise

According to the Wikimedia Foundation, the AI scrapers were responsible for these problems. The Foundation's own infrastructure is actually set up in such a way that particularly popular content is stored in one of several data centers. Requests are only passed directly to the central data center for less frequently requested content. This reduces the overall load, even if something unforeseen happens. In addition, people usually always visit the same content. AI scrapers would continuously retrieve as much content as possible and therefore always end up at the central data center. This increases the overall load and reduces the reserves in case interest in the encyclopedia suddenly increases.

Two thirds of the traffic, which requires the most resources, already comes from requests that are not attributable to human behavior in the browser. For the team responsible for the reliable functioning of its services, this is now causing constant interruptions. Automated requests of this kind are repeatedly blocked so that people can use Wikipedia and other content undisturbed. The traffic caused by the AI scrapers is “unprecedented” and means “growing risks and costs”, the Foundation writes. In return, there is no added value, for example through more visibility for Wikipedia and more visits from people.

Problems caused by the resources required to deliver content to AI scrapers are not new; in January, for example, the news site Linux Weekly News (LWN-net) made it public that the accesses were causing a veritable DDoS attack and that the site was therefore responding more slowly for everyone. The Foundation does not say who exactly is operating the AI bots that are putting a strain on Wikimedia's systems. However, it is likely that those responsible cannot be identified. A wide variety of AI companies train their models using data that is freely available on the internet, and Wikipedia is one of the best sources for this.

(mho)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.