AI training with synthetic data: "The internet is reaching its peak"

Page 2: "The hype will increase tenfold again"

Contents

In fact, the models that are trained from internet content are also likely to increasingly encounter AI content. Which is difficult to recognize.

The internet contains hundreds of trillions of words. OpenAI CEO Sam Altman said that OpenAI currently generates 100 billion words per day, or about 36 trillion words per year. Even if all that were to end up on the internet, it would currently only be a small percentage of the total amount of text. But perhaps it will become a more significant percentage in a few years' time.

What's more, the data collected from the internet is still being cleaned before it is used for training. While we cannot distinguish between AI-generated data and high-quality or low-quality data, we can distinguish between data that is repetitive. So if AI-generated data is of good quality, it could be used for training. However, if it contains a lot of spam, it will be filtered out and removed from the training dataset.

There is the time after humans started testing nuclear weapons - and there was the time before. This can be measured in the atmosphere. Can you compare that with the time before and after the launch of the large language models?

Perhaps. If things continue like this, in a few years it will be possible to clearly identify the age of LLMs based on the differences in power consumption.

Some AI researchers say that the time of ever larger models is over. We should rather develop smaller and more efficient models. Is this a possible solution?

Small and efficient models can definitely offer great added value, especially for simpler tasks. However, when it comes to overall performance, large models are currently unbeatable. And the human brain is still bigger than the biggest models we have, if you consider the parameters of AI models and the synapses of the brain as equivalent. So I assume that most applications will use smaller models in the future, but larger models will still be needed for more complex cognitive requirements.

You talk about the dream of a general artificial intelligence (AGI) or even a superintelligence.

That still depends on scaling, yes. But it could also require synthetic data. Or AI models that learn directly from real work, e.g. through their own experiments. It might also require other new forms of learning to get there.

As far as the practical benefits of current LLMs and chatbots are concerned, some observers are now more skeptical than they were just a few months ago. How long will the hype last?

Every additional order of magnitude of scaling becomes a new experiment. The development of models on the scale of an OpenAI GPT already cost hundreds of millions when LLMs were still practically useless and unknown to the general public. A few years later, they are generating billions in revenue for the company.

Now billions are being spent on developing the next generation. In a few years' time, we will see whether this new generation can generate sales in the tens of billions. If not, the hype will probably cool down considerably. If it works, then we will see another experiment, this time on a scale of 100 billion dollars, and the hype will increase tenfold again. (anw)