Technical measures against the onslaught of AI crawlers

AI companies are sending out masses of new crawlers. The IETF is also concerned with how these can be regulated.

listen Print view
Blackout,Concept.,Emergency,Failure,Red,Light,In,Data,Center,With

(Image: vchal/Shutterstock.com)

6 min. read
By
  • Monika Ermert
Contents

The onslaught of crawlers on web pages is now even prompting the Internet Engineering Task Force (IETF) to change its infrastructure. Within a year, ChatGPT requests to the IETF Datatracker, the central point of contact for the standardization work, jumped by 4000 percent. At the same time, several IETF groups are working feverishly on standards to help the network cope with the onslaught of crawlers.

"The increase in crawling traffic has forced us, as a relatively small provider, to react," says Robert Sparks, Senior Director of Information Technology at IETF LLC, the operational arm of the IETF, at a meeting in Madrid. Until a year ago, the standardization organization distributed its content, including the data tracker, which is the central platform for standardization –, from a single server. Now they have upgraded with a CDN.

Sparks calls the development "dramatic". Of the 3.23 billion requests that the IETF receives each month, 3.23 billion are immediately discarded. This is traffic from two unteachable bots. The proportion of bot traffic in the remaining data traffic is still around 10 percent and the AI crawlers are the frontrunners. ChatGPT accesses the data tracker the most, followed by GoogleBot, BrightBot and AliyunSecBot.

The figures were confirmed by several studies in a session of the IEFT's Measurment and Analysis for Protocols Research Group specifically dedicated to bot traffic. Cloudflare noted an increase in GPTBot traffic of more than 300 percent. The Wikimedia Foundation has seen a 50% increase in bandwidth demand from bots since January 2024.

At the same time, the crawlers are bringing in fewer and fewer readers because they are outputting the content themselves. Before more and more websites severely restrict or completely block access –, as recently announced by Cloudflare's "Content Independence Day" , technical standards should restore a better balance.

The AIPref working group is already on the home straight, which should allow content providers to declare their preferences regarding AI crawlers by means of an update for Robot.txt. Robots.txt was originally created so that pages could be provided with simple labels indicating whether or not they tolerate crawling on their pages. AIPref adds the category of AI crawlers of all kinds to Robots.txt. The preferences can also be presented in a field in the HTML header.

The new anti-crawler standard, which was originally requested by major media companies at a workshop in Washington, should be ready by the end of August.

However, there was still some discussion at the meeting in Madrid as to whether the proposed differentiation between AI crawler types is clear enough. Users should be able to decide whether they want to make distinctions, such as prohibiting the use of content for AI training but allowing the real-time search crawlers of the AI models. They wanted to start with a simple solution, explained Martin Thomson from Mozilla, who is one of the authors. At the same time, however, it had already been recognized that demarcations were difficult. For example, the boundaries between crawlers with different intentions are not clear.

To ensure that the new Robots.txt version is ultimately applied, the developers are already looking to EU legislators. The Robots Exclusion Protocol is already included in the Code of Practice, a companion piece to the European AI Regulation, in RFC 9309. However, these are recommendations on how to comply with the AI Act. Anyone who signs the protocol guarantees that they will adhere to the standards set out therein. Those who do not sign must find other ways to ensure compliance.

A group that met in Madrid for the first time now wants to tackle the problem from a different angle. In addition to the declarations of content providers regarding their preferences, bots of all kinds should identify themselves cryptographically.

This would allow content providers to better control crawler traffic, a representative of the BBC claimed. In the next step, it would then also be possible to negotiate licenses with the crawlers, explained Chris Needham, standardization expert at the BBC. Identification via user agents can be spoofed, while identification via IP addresses is complex and imprecise.

Videos by heise

Representatives from Google and OpenAI assured in Madrid that they want to be part of the "good bots" and support a corresponding standardization. Eugenio Panero from OpenAI said: "Because there is no standard, identification is hard and trashy", even in relation to partners with whom they have agreements. IP addresses change, which requires constant updates. Non-standardized headers could be spoofed.

HTTP message signing, which has been used to date, is inadequate. OpenAI hopes that a Webbotauth standard will make it easier for those who want to allow ChatGPT Agent requests to do so. A distinction is made between the ChatGPT Agent, which makes user requests, and the ChatGPT Bot.

The developers could not rule out concerns that better bot technologies could create hurdles for new bots and have concentration effects on the part of bots and content providers. A good solution would definitely have to consider and avoid centralization, acknowledged Mark Nottingham from Cloudflare.

WebBotAuth should be launched as a new working group as soon as possible.

Bing's Product Manager Krishna Madhavan also had his say. At the MAPRR meeting, he presented the IndexNow protocol used for Bing, which replaces crawler requests for updates with push notifications from content providers when they want to propagate new content or versions of their pages. The update signal allows for a better balance between effectiveness and "freshness" of information.

When asked by heise online, Madhavan said that the company was interested in bringing the technology to the IETF in order to create a standard. It remains to be seen whether Microsoft is prepared to hand over the previously proprietary API to a standardization process.

(olb)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.