Fighting AI training: More and more news sites are blocking the Wayback Machine

Fearing that the well-known archiving service will be used for AI training, more and more news sites are blocking the Wayback Machine – not just in the USA.

listen Print view
The Wayback Machine in a browser

(Image: Sharaf Maksumov / Shutterstock.com)

3 min. read

In the fight against the unauthorized use of content for training AI models, the Internet Archive is increasingly caught between the front lines, and the Wayback Machine is in danger of becoming collateral damage. This is suggested by an analysis by the Nieman Journalism Lab at Harvard University, according to which more and more news sites are blocking the archive site's crawlers. Accordingly, more than 340 local news sites are now restricting "the Internet Archive's ability to access and preserve their stories," but supra-regional and international media such as the New York Times are also participating. European media are also on the list, but currently none from Germany.

For the analysis, an author from the Nieman Lab evaluated an extensive database of robots.txt files from news sites around the world. If one or more crawlers originating from or appearing to originate from the Internet Archive were blocked, they included them. Accordingly, the archiving service is mainly blocked by regional newspapers belonging to one of five major US media groups. Between January and May alone, the number of sites blocking the Wayback Machine increased by more than 50 percent. In total, there are now 382 in the sample, with the vast majority being local and regional newspapers.

In January, the authors compiled statements from major media outlets explaining the blocking of the Internet Archive as an attempt to withhold their own content from AI models. For months, it has been criticized that AI companies acquire all sorts of content on the internet for training and do not adhere to conventions such as robots.txt. This allows site operators to actually block AI models, but AI companies would have to adhere to it. However, they do not. But even if they did, the route via the archived content on the Wayback Machine would be a detour that more and more responsible parties now want to close off. Reddit has also already done this, for example.

Videos by heise

While the Wayback Machine is only one of the Internet Archive's services, it is probably the best known. Internet pages have been archived there for decades. The site has long since become by far the most extensive source for tracking the development of the internet. However, the approach has always been associated with conflicts, as opposing interests have repeatedly clashed. As early as 2017, the Internet Archive had declared that it would no longer comply with robots.txt directives without exception. In the case of the latest blockages, however, this seems to be the case, as can be seen, for example, for the sites of El PaĂ­s or Le Monde.

(mho)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.