Copyright and AI training: "An almost paradoxical situation indeed"
Is it legal for AI companies to use content from creatives for training without payment? Professors Tim Dornis and Sebastian Stober have investigated this.
Library: AI systems need a lot of data.
(Image: JĂĽrgen Kuri)
Training generative AI models is not text and data mining. This is the conclusion of the new open access study "Copyright and training of generative AI models – Technological and legal foundations" , which was commissioned by the Initiative Urheberrecht. This is important, as AI companies want to rely on this, among other things, to avoid having to pay creative people anything.
(Image:Â privat)
"We began our investigation with an absolutely open mind and were also aware that the topic had not yet been extensively examined from an interdisciplinary perspective," say study authors Tim W. Dornis and Sebastian Stober. Dornis is a legal scholar and professor at the Faculty of Law at the University of Hanover and completed his J.S.M. at Stanford, while Stober is Professor of Artificial Intelligence at Otto von Guericke University Magdeburg.
In addition to negating the so-called TDM barrier, the researchers also assume that training data and thus copyrighted works are reproduced within the AI models. "This is important for legal action against infringements in AI training and AI use," they say. In an interview with heise online, Dornis and Stober explain what this could mean in concrete terms.
heise online: Mr. Stober, Mr. Dornis, in connection with the question of whether AI training in its current form violates copyright law, there is always talk of so-called text and data mining, under which this supposedly falls. What is TDM and what should its possible uses be?
Sebastian Stober: Data mining involves the automated extraction of new information, patterns and findings from large collections of data. Text mining involves text collections. The possible uses are very diverse. The insights gained can form the basis for business models – when, for example, markets and customer behavior are analyzed. In politics, analyses of public opinion are regularly used, for which text mining is carried out in the background in social media. Data mining is also an important tool in science for practically all data-based questions.
However, as with any technology, the intended use does not always have to be positive for society. For example, the information obtained can also be used to manipulate people in a targeted manner, as was shown in the Cambridge Analytica scandal. Here, society is called upon to set clear boundaries as to which uses are undesirable.
Many AI companies now think that their AI training falls under the so-called TDM barrier because they seem to take this from the AI Act. In other words, they think it is covered by law. Yet the roots of the AI Act were established years before generative AI systems even existed.
Tim W. Dornis: I don't see it that way either. It cannot be inferred from the AI Act and the legislative materials, if interpreted correctly, that generative AI training should fall under the TDM barrier. The wording alone raises doubts. Above all, however, an in-depth examination of the principles and background would have been necessary – with a particular focus on copyright law.
However, this was still lacking at the time the AI Act was finalized. In other words: As our study shows, a thorough examination of AI technology is required. This has simply been neglected in the legal opinion-forming process to date. However, the debate should not stop at this point just for the sake of "convenience" and with the argument that anything else could jeopardize "European AI innovation". Nevertheless, this seems to be the trend in the current legal debate.
Why is AI training so much more than TDM? It's just computers scouring the internet and drawing conclusions.
Stober: On the one hand, we have to distinguish between data collection and training –, which is often carried out by different actors, and once a data collection has been created, it can be used by a wide variety of actors to train a wide variety of AI models.
On the other hand, the concept of AI training must be differentiated more clearly. In our study, we placed great emphasis on emphasizing that we are talking about the training of generative AI models. There are regulations for text and data mining that allow the collection of data for the training of AI models. However, we come to the conclusion in the report that the training of generative AI models in itself does not fall within the scope of text and data mining – because, among other things, no new insights are gained in the process. The trained models can only generate further data that is similar to the training data. This is therefore a completely different purpose. The exemption therefore does not apply here and that is the problem.
Published material can be provided with a reservation against TDM, as provided for by law. Wouldn't that be a simple solution?
Dornis: On paper, it seems to be a simple solution. However, the practical implementation is anything but effective. You don't even have to go so far as to ask how to deal with works that have already been published (e.g. books). Should these be subsequently provided with inserts anywhere in the world?
For digital publications, too, we will have to assume that once things are "online", they can hardly be subsequently provided with a complete – and, above all, for crawlers & co. understandable – with a reservation. Finally, the question remains (as always) as to whether the AI developers (and their crawlers etc.) adhere to this at all.
Can you explain how the AI industry could assume that the training in its current form represents a kind of fair use (which does not exist in Europe)? First do it, then apologize, as Mark Zuckerberg once recommended?
Dornis: From a legal perspective, it's easy to explain: The mindset in Silicon Valley has always been "don't ask for permission, ask for forgiveness later". It is less about the idea of the legality of one's own actions and more about the conviction that, in the interest of innovation as a "good thing", short-term disruptions with the accompanying legal violations should also be possible.
Moreover, Silicon Valley could also evidently rely on the legal analysis already described. At least in Germany, the hypothesis of "training generative AI models = TDM" was put forward shortly after the capabilities and functionality of generative AI became known. More and more publications gradually followed suit.