AI companies used subtitles from thousands of YouTube videos for model training

Tech giants such as Apple, Nvidia and Salesforce have snapped up YouTube subtitles to train their own AI. The video creators knew nothing about it.

Save to Pocket listen Print view
Man holding transparent tablet with YouTube logo

(Image: metamorworks/Shutterstock.com)

4 min. read
By
  • Frank Schräer
This article was originally published in German and has been automatically translated.

Training artificial intelligence requires enormous amounts of data so that the AI can provide informed responses, especially for language models. This training data also includes subtitles from YouTube, as an independent investigation has now discovered. Not only does the extraction of subtitles violate the video service's guidelines, it was also done without the knowledge or consent of the content creators who published these videos on YouTube.

This is not the first case: back in April, it was reported that ChatGPT had been trained with one million hours of YouTube videos. OpenAI is said to have gained a head start in language models by automated transcription. Now, major companies such as Apple, Anthropic, Nvidia and Salesforce have made it easier for themselves by tapping into the subtitles already available on YouTube, eliminating the need to convert speech to text.

The dataset used to train AI language models includes video transcripts of educational YouTube channels from MIT and Harvard University in the US, as well as the Wall Street Journal and the BBC. Subtitles from popular TV talk shows by Stephen Colbert, John Oliver and Jimmy Kimmel were also used, as well as from YouTube channels with millions of subscribers, reports Proof News. The data collection includes two videos from MrBeast, seven videos from Marques Brownlee and 337 videos from PewDiePie.

The YouTube subtitles belong to a data set called "The Pile", which was generated by AI researchers from EleutherAI for open-source language models. The pile also contains documents from the European Parliament, Wikipedia texts and internal emails from the collapsed US company Enron. The data collection was created by EleutherAI founder Sid Black, who uses a script to retrieve the subtitles from the YouTube API, as he describes on GitHub. This data set is often used by researchers and scientists for academic purposes.

But it's not just academics who use the Pile data collection. Apple and Nvidia describe how they use Pile for AI training in various published documents. In April, Apple introduced new local LLMs, including the new OpenELM (Open-source Efficient Language Models) model family. The documents show that Apple has trained OpenELM using Pile data.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmmung wird hier eine externe Umfrage (Opinary GmbH) geladen.

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Opinary GmbH) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

Anthropic also confirms the use of Pile for training its own AI models, such as the AI assistant Claude. When asked, an Anthropic spokesperson explained that Pile contains a tiny selection of YouTube subtitles. "YouTube's terms cover direct use of the platform, which is distinct from use of the Pile dataset. As for possible violations of YouTube's Terms of Service, we must refer you to the authors of The Pile."

Current models for generative artificial intelligence deliver better results the larger the previously processed training material. Using YouTube subtitles, AI language models can learn human phrases, but also profanity. For example, Satesforce developers noticed that the Pile dataset contains curses and swear words, as well as "prejudices against gender and certain religious groups".

YouTube's terms of use prohibit access to videos in automated form, but so far, the video service has apparently not prevented the use of the EleutherAI script developed in 2020 to access the subtitles. It is not known whether Google uses the Pile data collection to train its own AI models such as Gemini, or whether the YouTube subtitles are used directly for these purposes.

In any case, the video creators are not pleased that parts of their content were taken without being asked. "Nobody came to me and said: 'We'd like to use this'," says the operator of a political channel with over two million subscribers and more than two billion views. Another content creator describes it as theft. It is disrespectful not to obtain consent. After all, there are signs that studios will use generative AI in the future to replace humans with artificial images.

(fds)