Nepenthes: a tarpit for AI web crawlers
Web crawlers for AI models often do not stop at copyright protection either – The Nepenthes tool sets a trap for them.
(Image: winnond / Shutterstock.com)
Web crawlers play a central role in the race for the best AI model: they automatically scour the web for content that developers can use to feed their large language model. Nepenthes is a tool that lures the crawlers into an endless labyrinth – or even feeds their endless hunger for data with masses of pointless content.
The big problem with the AI developers' web crawlers is that they do not even stop at copyright-protected content. Website operators can actually specify in robots.txt if they do not allow web crawling for LLMs. However, the corresponding instructions vary from AI model to AI model, and some companies are already trying to circumvent such blocks.
A tar pit for web crawlers
Programmer Aaron B. was particularly annoyed by the way things were going with web crawling for LLM purposes. Which is why he developed the Nepenthes tool. It shares its name with the carnivorous pitcher plant. The only difference is that the Nepenthes program doesn't catch insects, but web crawlers, according to B.
"It's a tarpit designed to catch web crawlers," writes B. on his website. The whole thing is intended for AI web crawlers in particular. "it'll eat just about anything that finds it's way inside", B. makes clear. This refers to web crawlers of a different kind, for example from search engines. Anyone who includes Nepenthes on their own site will most likely be kicked out of Google searches, warns B.
Nonsense fodder for AI crawlers
Nepenthes works by generating a page with around a dozen links that all link back to themselves. What's more, the Nepenthes pages have extremely long loading times, which ties up time for the crawlers. The concept can be tried out here (yes, loading the page at a snail's pace is intentional). If you have enough computing power and bandwidth, you can go one step further and feed the crawlers with Markov-generated nonsense that clogs up the hard disks of the AI servers.
Of course, there is also a catch: while the web crawlers – whether from an AI company or not – are working their way through Nepenthes, the server behind the website is constantly experiencing load peaks. The less powerful the server or the more crawlers are online at the same time, the greater the load. Although the IPs of web crawlers that have been caught can be blocked, Nepenthes is unlikely to run out of food given the number of web crawlers roaming the net. And if you really want to "shut the crawlers up" with Markov-generated content and many resources, you don't even start blocking IPs. Aaron B. warns urgently: "If you don't know exactly what you are doing, you should keep your hands off the tool.
Videos by heise
Whether Nepenthes really works as claimed is also questioned. Modern web crawlers have a fixed maximum number of pages that they crawl from a single website. This is often based on the popularity of the website writes one user in a thread on Hacker News. The endless labyrinth that Nepenthes actually wants to be would then no longer work, but the tool could still contribute to the goal of protecting your content from crawling. In an interview with 404 Media, B. comments on the argument from the Hacker News thread: "If this is true, according to my access data, even the almighty Google crawler is not protected in this way."
(nen)