Autocomplete: Large language models can repeat training data verbatim

Researchers show that LLMs can reproduce copyrighted training data almost verbatim. This means headaches for model providers.

listen Print view

(Image: Blackboard/Shutterstock.com)

5 min. read
Contents

If you want to read “Harry Potter and the Philosopher's Stone” but have misplaced your book, large parts of it could be extracted verbatim from large language models (LLMs) like Claude 3.7 Sonnet, Gemini 2.5 Pro, or Grok 3 with the right prompts. This is according to a preprint on arXiv, published by researchers from Stanford University.

The goal of their study was to find out whether the well-secured production language models from major providers can reproduce copyrighted works word-for-word from their training data. According to the LLM providers, the models do not memorize data during training but at most a representation of the content, which is why model training is transformative and the use of protected works falls under fair use. The current state of research casts doubt on this assumption.

Since large sections of copyrighted works can be extracted from open-weight models, the researchers wanted to test this characteristic of LLMs. They examined the proprietary models with better security measures: Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 – all models that are or have been in production. To achieve this, the scientists proceeded in two phases. First, they asked for a verbatim continuation of a text passage, for example, the beginning of Chapter 1 of the first Harry Potter novel. If rejected, they varied the wording of the prompt with random changes until they received a result or the model continued to refuse after 10,000 variations. The technique used is called Best-of-N (BoN) and is considered a jailbreak, meaning it bypasses the language models' security measures.

In the second step, the researchers repeatedly asked the model to continue the text based on the previously generated section. They compared the similarity of the text using a reference book and the near-verbatim recall (nv-recall) metric based on the longest identical text segment. This resulted in a text similarity of 95.8 percent for Claude 3.7 Sonnet for the first Harry Potter book and 76.8 and 70.3 percent for Gemini 2.5 Pro and Grok 3, respectively. GPT 4.1 refused to cooperate, with an nv-recall score of four percent for Harry Potter.

Videos by heise

The Stanford researchers report that they had to use the BoN jailbreak for Claude 3.7 Sonnet and GPT-4.1 to get the model to produce a result. Claude then reproduced four books almost completely verbatim, including “Harry Potter and the Philosopher's Stone” and “1984”. Gemini 2.5 and Grok 3 followed the instruction without further prompt engineering. The work concludes that large language models, contrary to the claims of model providers, memorize parts of their training data. The existing security restrictions at the model and system level would therefore not be sufficient to protect the models' training data from extraction.

The arXiv preprint follows up on a similar study from Stanford from May 2025, which investigated the reproduction of entire books in open-weight models like Llama 3.1. A study by researchers from ETH Zurich from November 2024 shows that up to 15 percent of the outputs from LLMs from providers OpenAI, Anthropic, Google, and Meta correspond to existing text segments on the internet. In some cases, the models literally repeat answers from their training data. This raises security concerns for companies with their models that are serviced by third parties. Training with synthetic data could also prove to be a source of further hallucinations in such a case.

For providers of large language models, the direct quotation of unlicensed copyrighted works becomes a nuisance if the copyright holders sue because of it. In the United States, The New York Times (NYT) is involved in a multi-year legal dispute with OpenAI, as the publisher managed to extract entire articles from ChatGPT using a method similar to the one in the Stanford preprint. In a statement, OpenAI argued that the NYT had used misleading prompts and that no user would use the models in such a way. Furthermore, verbatim reproduction is a rare bug. At least the current Stanford preprint contradicts this.

OpenAI already lost a court case against GEMA. The collecting society had sued, claiming that ChatGPT had reproduced song lyrics like “Atemlos” or “Männer” almost exactly upon request, which violated the rights of the authors. While OpenAI invoked the reflection of training parameters, the court ruled that the model must have memorized the lyrics and prohibited the storage of copyrighted texts for the future. Developers also argued about the verbatim reproduction of training data in a class-action lawsuit in the USA against Microsoft, GitHub, and OpenAI. The lawsuit stated that GitHub Copilot verbatimly outputted code from existing repositories without source attribution. Here, the responsible court ruled in favor of the model providers.

(pst)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.