Study for rights holders: AI training is copyright infringement

An analysis by Initiative Urheberrecht aims to shed light on AI training. The permission for text and data mining does not apply.

listen Print view
Abstract brain with AI lettering

(Image: incrediblephoto / Shutterstock.com)

5 min. read

The Copyright Initiative (IU) believes that a study it commissioned has provided evidence that the reproduction of works using models for generative artificial intelligence (AI) such as ChatGPT from OpenAI or Gemini from Google constitutes copyright-relevant reproduction. This could have far-reaching consequences for the further usability of chatbots, for example. A closer look at the technology used reveals that "the training of such models is not a case of text and data mining", Hanover law professor Tim W. Dornis explained, who carried out the analysis together with Magdeburg computer scientist Sebastian Stober. "This is a copyright infringement."

Dornis explained at the presentation of the study in the EU Parliament on Thursday that there is no valid restriction of the exclusive right of exploitation in German and European copyright law that allows use in the sense of commercial AI training. With their work, the two professors want to shed light on the black box of learning large language models. According to the study, AI manufacturers extract and exploit extensive syntactic and therefore copyright-protected information from the works used in the training data.

According to the study, copyright-protected works are copied during data collection, represented in whole or in part in the AI models and can ultimately also be reproduced by end users. During training, there are "numerous different acts of reproduction of copyrighted works". This starts with their "collection, preparation and storage". During both pre-training and fine-tuning, relevant copies are then created "inside" the model. Although there is no explicit storage mechanism, the training data is "memorized" in the current generative models, i.e. it is kept in their memory, so to speak.

Videos by heise

Finally, the researchers point out that the use of generative AI models could lead to copies and adaptations of the works used for training, particularly by their users during prompts. This would violate the creative's right of publication.

The stumbling block: ChatGPT & Co. and image generators such as DALL-E, Stable Diffusion and Midjourney are based on large language and image models. The operators train these with millions of photos, audio files and texts that they find on the Internet. As a rule, they do not ask authors and users whether they agree to this use. The use of the largely protected works is necessary in the field of AI modeling so that the algorithms can recognize patterns in the existing material and create adaptive content based on them.

In the EU, legislators have defined exceptions to the exclusive exploitation right for text and data mining in the latest major copyright amendment. The German parliament has implemented this requirement in sections 60d and 44b of the Copyright Act. Reproductions of legally accessible digital works are therefore permitted, for example for AI training, "to obtain information from them, in particular about patterns, trends and correlations". Research institutions are entitled to do so, provided they do not pursue commercial purposes, reinvest all profits in science or "operate in the public interest within the framework of a state-recognized mandate". This is intended to prevent large-scale data mining by research institutions in the service of companies.

Authors and exploiters who wish to prevent text and data mining of their works available online despite such precautions can reserve the right of use for themselves. Such an announcement is only effective if it is made "in machine-readable form" – for example via the robots.txt file.

However, the authors emphasize that current copyright barriers only cover the infringements of copyright associated with the training of generative AI models "in a few, practically irrelevant constellations". Even if the training takes place outside Europe, the developers cannot escape European regulations.

MEP Axel Voss (CDU) welcomed the evidence now available. He hopes that the study will provide "further important indications and suggestions for a better balance between the protection of human creativity and the promotion of AI innovations". The researchers suggest that legislators should decide how the balance between the protection of human creativity and the promotion of AI innovations can be achieved. For Hanna Möllers, legal advisor to the German Journalists' Association (DJV), the results are "explosive". They show "that we are dealing with a large-scale theft of intellectual property". Politicians must now put an end to this "robbery" at the expense of authors.

The experts provided "the technological and copyright basis for finally turning the legal consideration of generative AI on its head", emphasized Matthias Hornschuh from the IU. A "new, profitable licensing market has long been on the horizon", which providers of generative AI have so far cleverly avoided. Various lawsuits have already been filed against OpenAI by authors and media companies such as the New York Times.

(mma)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.