OpenAI under suspicion: GPT-4o allegedly trained with O'Reilly books

A study shows that OpenAI trained its AI model GPT-4o with books from the US tech publisher O'Reilly. The authors demand license contracts for AI training.

listen Print view
Logo and name of OpenAI on a smartphone, in the background enormously enlarged red pixels

(Image: Camilo Concha/Shutterstock.com)

3 min. read

The US software company OpenAI is said to have used at least 34 books from the O'Reilly publishing house to train its AI model GPT-4o without permission. This is indicated by a study by the AI Disclosures Project, in which the founder and CEO of the publishing house Timothy O'Reilly himself was involved. In the study, the researchers examined two other models of the company, GPT-3.5 Turbo and GPT-4o mini, but found less clear evidence of potential copyright infringements by O'Reilly Publishing.

In their investigation, the study authors asked OpenAI's AI models a series of multiple-choice questions. One of the four answer options was a verbatim quote from one of the 34 O'Reilly books examined, while the other three options were paraphrased versions of these. In total, they used almost 14,000 excerpts from the books. If the chatbot recognized the literal quote, the researchers interpreted this as an indication that the respective AI model had been trained with copyrighted material from the publisher.

Specifically, the study authors calculated a so-called AUROC value, which can be derived from statistical studies. Higher values indicate a higher probability that OpenAI trained an AI model with the books published by O'Reilly. For GPT-4o, the researchers determined a value of 82 percent, from which they derived a clear indication that the content of the books was used to train the model. They also suspected that OpenAI used a database from the shadow library Library Genesis, which contains all 34 books.

Furthermore, the researchers of the AI Disclosures Project concluded that the importance of non-public data in the training of OpenAI models has increased over time. For example, the GPT-3.5 Turbo model with a database from 2021 achieved an AUROC value of 54 percent for non-public extracts. However, the GPT-4o mini model published in 2024 achieved a similar value of 56 percent. According to the authors of the study, this suggests that OpenAI did not train these two models with the O'Reilly books.

Videos by heise

Although the study is an individual examination of the OpenAI models and works from the O'Reilly publishing house, the authors see a systematic problem in the use of copyrighted works to train language models. There is also a need for more transparency and a formal licensing framework for the content used in training. Without appropriate remuneration, there will no longer be any content with which the models can be trained in the future. Most recently, the New York Times also filed a lawsuit against OpenAI for copyright infringements in the training of AI models.

(sfe)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.