AI: Collapse looms due to AI-generated training data

Researchers discovered that AI models sabotage themselves by using AI-generated data for training. They then produce more and more garbage.

Save to Pocket listen Print view
Some cows in a pasture that look strange, for example with zebra patterns

AI models know what cows look like. Not in a few generations.

(Image: Erzeugt mit Dreamstudio)

6 min. read
By
  • Michael Link
Contents
This article was originally published in German and has been automatically translated.

AI models could choke on themselves. They could become completely dysfunctional if they are fed with data that is itself AI-generated. This scenario was played out by researchers at Rice University in Houston, Texas. For their study "Self-Consuming Generative Models Go MAD" [PDF], they used generative image generation that can visualize the problem in a comprehensible way. The study focuses on generative image models such as the popular DALL-E 3, Midjourney and Stable Diffusion. In the study, the researchers showed that the generated images become worse after several iterations of the underlying models when the AI-generated images themselves are used to train new AI generations.

Richard Baraniuk, Professor of Electrical and Computer Engineering at Rice University, explains: "The problems arise when this training is repeated over and over again with synthetic data, and it forms a kind of feedback loop. We call this an autophagic or self-consuming loop". His research group has been working on such feedback loops. Baraniuk: "The bad news is that after just a few generations of such training, the new models can be irreparably damaged. This has been referred to by some as model collapse, for example by colleagues in the context of large language models (LLMs). However, we find the term 'Model Autophagy Disorder' (MAD) more appropriate, regarding mad cow disease."

Mad cow disease is a neurodegenerative disease that is fatal to cows and has a human equivalent caused by eating infected meat. The disease gained widespread attention in the 1980s when it was discovered that cows were being fed the processed remains of their slaughtered counterparts - hence the term 'autophagy', from the Greek 'auto', meaning 'self', and 'phagy', meaning 'to eat'.

The hunger for data to train new artificial intelligence (AI) models such as OpenAI's GPT-4 or Stability AI's Stable Diffusion is immense. It is foreseeable that AIs will gradually integrate ever larger amounts of text, images and other content into their training, which in turn is not man-made. In other words, they will be fed with their own data, in the image of mad cow disease.

The researchers at Rice University investigated three scenarios of such self-consuming training loops to provide a realistic representation of how real and synthetic data are combined into training datasets for generative models.

In the fully synthetic loop, successive generations of a generative model were fully fed with a synthetic data diet taken from the outputs of previous generations. In contrast, in the synthetic reinforcement loop, the training dataset for each generation comprised a combination of synthetic data from previous generations and a fixed set of real training data. In the third scenario, the fresh data loop, the new models were each given a mixture of synthetic data from previous generations and a fresh set of real training data for training.

Fifth-generation AI portraits become more and more similar after training with AI data

(Image: Studie: "Self-Consuming Generative Models Go MAD" ([Quelle: https://arxiv.org/abs/2307.01850])

Progressive iterations of the loops showed that the models produced increasingly distorted images over time, the more so the less fresh data they received for training. Comparing successive generations of image datasets, one can see the progressive impoverishment: images of faces are increasingly riddled with grid-like scars - what the authors call "generative artifacts" - or they look more and more like the same person. Data sets consisting of numbers are transformed into indecipherable scribbles.

The problem is also exacerbated by human activity itself. Images of plants, for example, are predominantly flowers, people photographed are more likely to be smiling than in normal life, and vacation pictures in the mountains usually show sun and snow. When training AI with such data, it could come to the conclusion that most plants are flowers - which is not the case. It could assume that people smile a lot - which is not the case - and that there are always blue skies in the mountains. After a few model generations, it is no longer possible for AI generators to represent stalks of wheat, crying children or a rain shower while hiking in the mountains.

Just as the gene pool is shrinking due to the extinction of animal and plant species, the range of things that AI generators can generate themselves is also shrinking.

AI developers no longer just have the problem of which data they are allowed to use. According to the study, the convenient way of using AI-generated data for training appears to be the suicide of the business model in installments. In their own interest alone, AI developers should not use AI data to train future models so that their AI generators continue to function in the long term. In fact, companies would have to agree on standards, but this is not in sight. At the very least, labeling the content generated by AI tools on the web is obviously essential, not only for consumers but also for the developers themselves.

Already, the data available for training is so scarce that AI-generated content has long been used for this purpose - with an increasing risk of contamination for the "data cattle madness". If AI content was always labeled, companies could exclude it from training to protect their new AI generator models. However, companies would then have to satisfy their hunger for data differently. They would then have to fall back on content produced exclusively by humans and also better recognize the content they may have produced with AI support as such. Against this backdrop, the question of remuneration for the use of such data for training purposes arises anew: obviously, man-made content will remain valuable.

(mil)