Crawler without limits: Perplexity ignores robots.txt

Much criticism of Perplexity. The AI search engine does not adhere to rules and reproduces information without permission and incorrectly.

Save to Pocket listen Print view
Typing hands, with symbols and the letters AI floating above them.

Typing hands, with symbols and the letters AI floating above them.

(Image: Shutterstock/Poca Wander Stock)

4 min. read
This article was originally published in German and has been automatically translated.

The robots.txt file can actually be used to prevent crawlers from scanning the content on your website. But Perplexity is obviously not adhering to this. Wired observed the bot responsible and carried out tests. The AI search engine did not perform particularly well.

Perplexity is an "answer engine", as CEO Aravind Srinivas explains in an interview with heise online. Instead of a list of links, Perplexity gives you an answer in continuous text, peppered with links to sources and key points. As with a chatbot, you can ask further questions and go deeper into a conversation on a topic. Perplexity uses real-time information as well as snapshots for this purpose. Crawlers index the web on a daily basis. One such bot has now come to the attention of the news magazine Wired. It ignores the web standard of robots.txt - a file that tells the crawlers: not here, please.

As a result, Perplexity can also quote and reproduce articles that the response engine should not have. This affects articles from Wired, but according to its own statements, also other Condé Nast publications. This is the publishing house behind the tech magazine, which also owns numerous fashion and lifestyle magazines such as Vogue and Glamour.

It has also been noticed that the reproduction is not even necessarily correct. Wired claims to have entered several article headlines into the search to get summaries. "The results showed that the chatbot sometimes paraphrases the Wired articles very accurately, but sometimes also summarizes them inaccurately and with minimal citation," writes Wired. In one case, however, Perplexity is said to have written that Wired had reported on a police officer who had committed a crime. This is simply wrong.

Perplexity is said to have recently published a list of IP addresses in order to be more transparent - which has since been withdrawn. However, according to Wired, the Perplexity bot must be using at least one unpublished IP address, with which it has flouted the rules for scraping content. It came to the publisher's attention. Wired writes that there have been at least 822 hits in the past three months. However, this is a "massive underreporting" as the publisher only keeps a fraction of its network logs.

Perplexity's CEO has not commented specifically on the allegations, saying only that Wired misunderstood how the web and Perplexity work.

Other publishers have also complained about Perplexity. Forbes, CNBC and Bloomberg, for example, complain that so-called Perplexity Pages, which are AI-generated overviews reminiscent of Wikipedia pages, are based on exclusive contributions from them. The content is also behind paywalls. Forbes, for example, discussed the secret work of former Google CEO Eric Schmidt. He is said to be involved in a project on combat drones. The corresponding overview page at Perplexity only makes a very small reference to the source. Srinivas then explained to X that they agreed that the sources should be linked more prominently. This response is unlikely to be satisfactory for Forbes or other publishers. According to Axios, Forbes has since filed a lawsuit against Perplexity.

In an interview with heise online, Srinivas said that further announcements will be made shortly on how publishers will share in Perplexity's revenues in the future. The CEO envisions a new analytics system that no longer focuses on clicks, but instead counts how often information is read or used.

According to a Semafor report, Perplexity is also in talks with publishers to enter into partnerships with them. OpenAI also pays individual publishers money to be able to use their content for training their own AI models and to display it preferentially in their own products.

(emw)