Nvidia: Court documents reveal correspondence regarding pirated dataset

US tech giant Nvidia has approached the shadow library Anna’s Archive and negotiated access to millions of pirated copies.

listen Print view
Nvidia logo on graphics card

(Image: Konstantin Savusia/Shutterstock.com)

4 min. read
By
  • Robin Ahrens

The US tech giant Nvidia is said to have contacted the archive project Anna’s Archive to gain access to millions of pirated books. This is evident from court documents first published by the blog Torrentfreak. According to the documents filed as part of an amended complaint in the U.S. District Court for the Northern District of California, a member of Nvidia’s data strategy team directly approached Anna’s Archive. The terms for particularly fast access to around 500 terabytes of data from the shadow library were reportedly discussed.

The background to the now-published internal information is a class-action lawsuit filed in January 2024 by three US authors against Nvidia. They accuse the graphics processor manufacturer of using their copyrighted works to train its in-house AI models, such as the NeMo framework, without permission and are demanding compensation. The authors’ affected works were part of the Books3 dataset, comprising over 196,000 books, from the shadow library Bibliotik. Other authors have already joined the original plaintiffs. However, potentially hundreds more authors could follow.

Last Friday, the plaintiffs filed an amended complaint with the district court in California, which includes explosive correspondence between an employee of Nvidia’s data strategy team and Anna’s Archive. The email exchanges cited by Torrentfreak show that Nvidia specifically contacted the shadow library to enable the integration of its content into the training data of Nvidia’s own Large Language Models (LLMs).

Anna’s Archive reportedly demanded more than 10,000 US dollars for so-called express access to the hosted data, after which Nvidia inquired about the exact modalities of such accelerated access. Nvidia was also informed by those responsible for the shadow library that the requested datasets had been illegally acquired and maintained. Anna’s Archive therefore asked if there was internal authorization. Nvidia reportedly granted this within a week, after which the shadow library granted access to the approximately 500 terabytes of pirated books. Whether Nvidia actually paid for access to the data is not revealed in the court documents.

According to Torrentfreak, this is the first time that email exchanges between a major US tech company like Nvidia and Anna’s Archive have been published.

Videos by heise

As part of the amended complaint, Nvidia is accused of having downloaded and used data from the shadow libraries LibGen, Sci-Hub, and Z-Library for LLM training, in addition to the Books3 dataset. Furthermore, Nvidia is alleged to have distributed scripts and tools that enabled corporate customers to download “The Pile.” “The Pile” is an open-source dataset of more than 886 gigabytes used for training LLMs. In addition to public domain works, the corpus also contains the pirated Books3 dataset.

The lawsuit against Nvidia is not the first of its kind. The New York Times has already sued OpenAI. ChatGPT, the company’s AI-powered chatbot, is said to have reproduced the newspaper’s copyrighted content verbatim. The New York Times has already filed the next lawsuit, this time against the AI search engine Perplexity. In Germany, GEMA has won against OpenAI in the first instance.

(rah)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.