Human Rights Watch: Photos of children from Brazil in AI training data

The LAION-5B AI data set contains masses of photos of Brazilian children - without their consent. Human Rights Watch criticizes this.

Save to Pocket listen Print view
A photo of baby feet.

Photos of newborns and children are included in the LAION-5B data set with name, place and date.

(Image: paulaphoto/Shutterstock.com)

4 min. read
This article was originally published in German and has been automatically translated.

An approximately two-year-old child touches the fingers of her newborn sister. This picture is included in the data set with the name LAION-5B. It also contains information on the names of the two girls and the hospital where the photo was taken. Human Rights Watch has found around 170 photos of children from Brazil in the data set, which is used for the training of AI models, among other things. However, according to the organization, this is only a fraction of such photos. They criticize the fact that the children did not consent to this and warn that the images could be misused.

"Children should not have to fear that their photos can be stolen and used against them," says Hye Jung Han, children's rights lawyer at Human Rights Watch. In a blog post, he calls on governments to enact laws as quickly as possible to protect children's data from AI misuse.

LAION-5B is just one of numerous data sets on offer that are used for AI training. Content from the internet is scraped, i.e. collected and processed. For example, unwanted and criminal content is sorted out and flagged by low-cost workers. The extent to which processing requires consent is regulated differently around the world or is still unclear. On the one hand, there is the question of the copyright to the data, and on the other, there is the issue of data protection and the processing of personal data.

Human Rights Watch analyzed only 0.0001 percent of the 5.85 billion images contained in LAION-5B, including captions. They also found pictures of births, birthdays and children dancing in their underwear. According to the activists, many of these photos were originally only visible to a small circle of people. They could not be found via a search engine. Some of the images were uploaded many years ago, years before LAION-5B and concerns about AI applications even existed. AI models that have been trained with the photos can be output one-to-one or similarly.

LAION is a German non-profit organization. They have announced that they will delete all known content from the data sets. According to Humans Right Watch, the association also states that children and their legal guardians are responsible for removing personal photos of children from the internet - this is the most effective protection against abuse.

Numerous website operators are now trying to exclude crawlers from their sites in order to protect their content. Meta, for example, collects images and posts itself in order to use them to train its own AI models. They are currently obtaining permission to do so by amending their privacy policy. Consumer and data protection advocates are criticizing this approach and calling for it to be stopped.

Google also says it uses all available content from the internet. OpenAI is mostly silent when it comes to the origin of the training data. However, CTO Mira Murati has said that all freely available data was used in the Sora video AI, including from the meta platforms, i.e. Facebook and Instagram. She was not so sure about YouTube, or so she said. Google lodged a complaint that if OpenAI had used videos from the platform, this violated the terms of use. In order to be able to continue using articles, OpenAI has concluded several contracts with publishers. The New York Times prominently complained that OpenAI had used its copyrighted articles without permission.

(emw)