Used by AI crawlers: Reddit largely locks out the Wayback Machine

Only AI companies that pay for this are allowed to train their models with data from Reddit. Allegedly, some have tapped into the internet archive instead.

listen Print view
Reddit logo on smartphone

(Image: KateV28/Shutterstock.com)

2 min. read

Reddit has begun blocking the Internet Archive's Wayback Machine, which will only be allowed to archive screenshots of the popular social news aggregator's homepage in the future. The online portal's spokesperson announced this to various US media outlets, explaining the move as a measure against unauthorized AI crawlers. The Verge summarizes the criticism by stating that they had used the internet archive to access content from Reddit that they were not allowed to access on the site itself. Reddit has concluded contracts with some AI companies that are allowed to train their models with user-generated content. Others are prohibited from doing so.

The company spokesperson did not say which AI companies had taken advantage of the detour via the Wayback Machine to obtain the coveted Reddit content as AI training material, adds Ars Technica. However, he did explain that the Internet Archive could take steps to regain access. This would involve a better defense against crawlers, but also more respect for the rights of Reddit users. For example, the Wayback Machine was sometimes used to view deleted entries on Reddit that were held by the Internet Archive. The Internet Archive has not commented on the announcement and merely referred to its long collaboration with Reddit. The topic of AI crawlers was discussed.

Videos by heise

For the various AI companies, Reddit is a particularly valuable source for training their models with content written by humans. The portal started asking for money for permission to use this data a year and a half ago. In return, the crawlers were blocked from search engines and AI technologies from which no money flows. However, Reddit has long claimed that they do not always adhere to this. At the beginning of June, the portal filed a lawsuit against the AI start-up Anthropic, which is said to be using the platform and data unlawfully. Anthropic believes that it can take and use any content with impunity: “This is not the case.”

(mho)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.