AI training with synthetic data: "The internet is reaching its peak"

Large language models are getting bigger and bigger - and require more and more training data. What happens when all human knowledge has been mined?

Save to Pocket listen Print view
Robots surrounded by speech bubbles

Robots surrounded by speech bubbles: No input, no output.

(Image: Erstellt mit Midjourney durch heise online)

9 min. read
Contents
This article was originally published in German and has been automatically translated.

Researcher Villalobos.

Pablo Villalobos is a staff researcher at the AI research institute Epoch AI in San Jose, California. He is particularly interested in the question of how AI models can be trained efficiently and what data is needed for this. In an interview with heise online, he talks about the phenomenon of "peak data", which concerns many AI companies.

heise online: The internet is huge and seemingly countless amounts of information are added every day. Nevertheless, you warn of an information peak, i.e. the day when there is nothing left with which to train AI models. You compare this "peak data" with "peak oil", the moment when there is no more oil or gas to be extracted. Isn't your analysis a little exaggerated?

Pablo Villalobos: Well, the analogy with peak oil may sound dramatic, as does the warning that we could run out of data. Nevertheless, the Internet is reaching peak production. Ultimately, I think we have to expect a transition at a certain point, when AI models have learned most of what the internet can teach them. Then we need to find new sources of knowledge to further improve the models.

How far away are we from that moment? Is it even possible to track this development?

According to our current estimates, we are still a few years away from that moment, probably between two and six. Other researchers have already made predictions about the amount of data on the Internet and come up with somewhat higher or even lower time frames. But as long as the amount of data continues to triple every year, it will inevitably happen. And I am sure that the AI companies themselves have precise estimates of how much data they can access and when this will no longer be enough for them.

OpenAI, Anthropic and other leading AI companies have stated that they could also use synthetic data for LLM training purposes, i.e. simply have LLMs generate it themselves. How does this work in practice?

Synthetic data is basically a very simple idea that people use to generate a lot of new knowledge in mathematics, for example. We think hard about a problem, try out different approaches, discard the ones that don't work and keep the ones that do - until we have learned how to solve the problem. Then we train with the raw output of another model problem and move on to the next one.

There are many ways to do this. Basically, it would probably involve many instances of a model like GPT-4 reviewing and then curating text written by other instances. For example: several of these instances read in a book and create a review describing the strengths and weaknesses of the work. Then other instances rate these reviews and select the best ones, while still more instances provide feedback on these reviews. Finally, the models create a thoroughly revised list of improvements and create a new, improved version of the book.

I imagine it's a bit like the famous snake Ouroboros from mysticism that eats itself. What about the problem of so-called model collapse, where the models only write nonsense because they have been trained on themselves, so to speak?

The approach described above is more complicated than when a model simply writes down everything that comes into its head and the next model is then trained on it. That's the price you have to pay if you want to avoid the degeneration you mention.

And it's true: A model trained directly on its own output is like a student grading his own exam after he's just taken it: At best, he learns nothing, and at worst, he reinforces the mistakes he made. In the above approach, however, the procedure is more reminiscent of an expert criticizing his own and other experts' arguments and thus advancing his field.

When does model collapse occur and when does it not?

There are several studies on this. They show that the repeated training of models using the raw data of other models ultimately leads to degeneration.

But there is also the counter-example: AlphaZero, for example, which became an expert in the game of Go by playing against itself. And there is the example of AlphaGeometry, which learns to prove theorems in geometry by learning from its past mistakes and successes.

For what it's worth, in practice I doubt that model collapse will be a really big obstacle. It's just a matter of finding the right combination of trial and error with built-in self-correction. But even that will be quite a lot of work.