LLaMA clone: RedPajama – first open-source decentralized AI with open dataset

RedPajama has reproduced LLaMA's training dataset of over 1.2 trillion tokens and is making it open-source – kicking off a decentralized AI project for LLMs.

In Pocket speichern vorlesen Druckansicht

The mascot is based on a nursery rhyme by Anna Dewdney: "LLama LLama, Red Pajama."

(Bild: Together)

Update
Lesezeit: 9 Min.
Von
  • Silke Hahn
Inhaltsverzeichnis

(Diesen Artikel gibt es auch auf Deutsch.)

The LLaMA training dataset with over 1.2 trillion tokens has been reproduced and is open source: The RedPajama project says it's building a series of open-source large foundation models to counter black box models such as GPT-4. According to a recent blog post, RedPajama has completed the reproduction of the LLaMA dataset and is making it freely available to the public.

RedPajama is an alliance of top-class researchers from Canadian universities (Mila Québec, Uni Montréal), several research institutes at Stanford University (Stanford CRFM – Center for Research on Foundation Models; Hazy Research at Stanford AI Lab), TogetherCompute, LAION, EleutherAI and other partners who are pooling their expertise, research and hardware resources for the project. RedPajama has set three goals, according to the blog post:

  • Pre-training data, which needs to be both high quality and have broad coverage
  • Instruction tuning data and models, which improve the base model to make it usable and safe
  • Base models, which are trained at scale on this data

The project has now completed the first step with the release of the foundation dataset.

The most powerful foundation models are currently closed behind the APIs of commercial providers such as OpenAI, writes decentralized AI cloud provider Together on behalf of project stakeholders. Independent exploration of such models, personalization (taking into account divergent user needs), and their use for sensitive as well as confidential data, are precluded by restricted access.

There are already approaches to openly replicate large AI models, but until now they do not offer the same quality and performance as commercial models. The grassroots AI "EleutherAI" presented the Pythia series, on which Databricks' Dolly 2.0 and others are based, and LAION's OpenAssistant project, led by Andreas Köpf and Yannic Kilcher, published a free model including a high-quality open-source dataset. This had been created as crowdsourcing volunteers (man-made) and went through in-depth review and moderation processes. Various models such as Pythia-12B, but also LLaMA, served as a starting point here – the LLaMA model stages cannot be published due to unresolved licensing issues.

Dashboard of Meerkat for exploring the GitHub subset in Corpus. The screenshot shows a preview.

(Bild: Hazy Research (Meerkat Repository))

Offshoots of the LLaMA model, which offers restricted access for researchers and was partly leaked on Bittorrent, have the disadvantage of existing in a legal gray zone, since Meta AI has not released LLaMA under an open-source license. Only selected research projects can gain legal access upon request. The resulting models are neither open source nor suitable for commercial use. Since then, a number of semi-open models have been circulating on the Internet: in addition to LLaMA, these include Alpaca (Stanford University), Vicuna, LLaVA and Koala (Berkeley University). In addition, numerous offshoots have made use of the OpenAI API to generate synthetic training datasets, in violation of the US vendor's Terms of Use.

OpenAI prohibits the use of its products to create competing products and reserves the right to take legal action against such projects. It is becoming apparent that this is not a paper tiger and is likely to be adjudicated in the future: for example, Microsoft has begun penalizing customers who develop potential competitors for GPT-4, threatening to restrict their access to Bing search data. Microsoft is the largest funder and major investor in OpenAI, with exclusive rights to use its models.

RedPajama starts as a project with the goal of creating fully open as well as reproducible Foundation Models that can compete with the world-class in terms of capabilities. In addition to the Canadian and US research institutions mentioned (Mila Québec, Montréal, Stanford Center for Research on Foundation Models) as well as open-source AI associations (LAION, EleuterAI), Ontocord.AI is also involved as a partner, a specialist in creating training datasets for large Foundation Models with several billion parameters.

The starting point for the project was apparently the research paper on LLaMA, as their dataset is considered to be particularly comprehensive, high quality and well filtered. In addition, a model the size of 7 billion parameters (like LLaMA) can be run on most GPUs, which is of concern to the open-source community with limited resources. Since existing offshoots such as Alpaca, Vicuna, and Koala are only available for research purposes, the RedPajama goal is to have a fully reproducible open-source replica of LLaMA that is also open to commercial applications. Alongside this, it is also intended to provide researchers with a more transparent pipeline for large-scale AI model

The base dataset resides compressed in two sizes in a Hugging Face repository. It consists of seven different data sources:

  • Common Crawl (as per Common Crawl Foundation Terms of Use)
  • Colossal Clean Crawled Corpus: C4 (as per C4-Lizenz)
  • GitHub (MIT, BSD, Apache only)
  • arXiv-Paper (as per Terms of Use)
  • Books (as per the_pile_books3 license and pg19license)
  • Wikipedia (as per Wikipedia license)
  • StackExchange (according to the Internet Archive license)

Tokens from RedPajama and LLaMA compared: The training dataset from RedPajama is roughly the same as reported by Meta AI in the LLaMA paper. The values reported for LLaMA are based on estimation according to the data in the research paper published at arXiv.org.

(Bild: TogetherCompute)

The common crawl of freely available Internet data makes up the lion's share, with 878 billion tokens. C4 (Colossal Clean Crawled Corpus) is a highly filtered standard dataset generated by Google containing 175 billion tokens. The Washington Post, together with the Allen Institute, has conducted a meticulous analysis of the 15 million web pages that feed into C4 and found that they contain copyright and found the copyright symbol in them about 200 million times, Pirate sites (that make copyrighted material freely available) are said to be included in the dataset, and in particular U.S. news sites are being widely grazed for C4. Colossal Clean Crawled Corpus is also the subject of an independent scientific study by Margaret Mitchell and researchers of the Allen Institute.

59 billion tokens come from GitHub (data is filtered by licenses and quality). Scientific articles from arXiv.org (28 billion tokens) are used to reduce repetition. In terms of books, a corpus of open-access books had flowed in (which the team de-duplicated to avoid bias, 26 billion tokens). Wikipedia contributed 24 billion tokens (a "subset" of Wikipedia pages had gone into the training), and StackExchange provided 20 billion tokens with a sub-dataset of websites popular there. Duplicates were removed.

At least two of the data sources used are subject to the caveat that they may infringe copyrights, as one copyright lawyer pointed out on Twitter: Common Crawl and the book collection "The Pile", but the C4 dataset may be subject to concern as well. Vendors such as OpenAI evade scrutiny by recently failing to specify what training data they used to create GPT-4. More detailed information on how to prepare the data and quality filters can be found in the project's GitHub repository. The recipes for preparing the RedPajama data can be recooked. This is significant because collecting and cleaning data can account for up to 90 percent of the effort in a machine learning project that uses real-world data (not synthetically distilled data).

The next step in the project, according to the roadmap, is to train a strong base model. To that end, RedPajama is part of the U.S. INCITE program (with access to supercomputers at the U.S. Department of Energy's Argonne Leadership Computing Facility) and is receiving support from the Oak Ridge Leadership Computing Facility (OLCF for short), also supported by the U.S. Department of Energy (DOE). It is foreseeable that RedPajama's release of the training dataset and, in the future, open models will be followed by a new wave of LLM offshoots appearing on the scene, this time open source instead of gray area. RedPajama is the beginning of a large project of open source, decentralized AI. The first models are already expected to appear "in the coming weeks."

The RedPajama announcement can be found on Together's blog. The dataset can be downloaded from Hugging Face. The data to reproduce the results is available under Apache 2.0 license on GitHub. Those who want to actively participate in the project can join the RedPajama Discord.

Update

Washington Post article on the C4 analysis added.

(sih)