Model collapse - how synthetic data can kill AI

Generative AI is only as good as its training data. According to a study, the internet will soon be too bad for AI thanks to AI.

(Image: photoschmidt/ Shutterstock.com)

Jul 26, 2024 at 1:04 pm CEST

4 min. read

By

Eva-Maria Weiß

According to a study published in the scientific journal Nature, AI is at risk of collapse. The reason for this is the training data, which AI itself renders useless. AI generates this training data, which becomes more and more similar until nothing meaningful is created. Scientists are focusing on the internet as a source of synthetic data and discussing the deliberate poisoning of data.

Large language models, and similarly image generators, learn from the training data made available to them. Very briefly, they derive probabilities from it. Answers therefore consist of what is most likely to fit the question, or a sequence of probabilities that form the sentences.

In a commentary on the current study in Nature, the problem of model collapse that arises from this is explained using dogs. First of all, there are different breeds of dog. Golden retrievers appear slightly more frequently in the training data than other dogs. So, in a first step, the AI also shows a golden retriever more often when asked for a dog. In the further development of AI, it now also uses the data from the first step - in other words, it uses the training material that already shows Golden Retrievers more frequently. This means that at some point, only golden retrievers will be seen as dogs by the AI. The authors of the study assume that this will be followed by an actual collapse of the models.

The fact that so-called synthetic data, i.e. data created by an AI, can become problematic has already been described several times. They are repetitive and threaten to overwrite previously learned knowledge, i.e. the various dogs. Model collapse is also often compared to the snake Ouroboros from mysticism: it continues to eat itself in a circle.

Read also

AI training with synthetic data: "The internet is reaching its peak"

In Nature, the authors write that the internet is flooded with such synthetic data. However, they do not mention that the content is AI-generated. They compare the problem to attempts to infiltrate social media and search engines with bad content, for example from bot farms. However, this is much easier to handle. "Large language models must be trained to produce results with a lower probability." They are crucial for understanding complex systems.

It will therefore very soon no longer be an option for AI providers to use freely available data from the internet - as they have been able to do up to now. Apart from the emerging regulations from website operators, who are increasingly excluding crawlers.

Data is not infinite

However, the available data has a limit. Scientists also speak of an information maximum. At the same time, AI providers are trying to scale their models further and further, i.e. to train them with more and more data. Some scientists even see this as an opportunity to create Artificial General Intelligence (AGI). However, most researchers are critical of this project - they do not believe that scaling alone can create capabilities such as logical thinking.

However, scaling requires more data. After the peak, this can only be produced synthetically. The advantage would be that fewer clickworkers in the world would have to sift through, declare and sort out material that should not flow into the models - but this would place a heavy psychological burden on the people who do this task.