Instead of NotebookLM Audio Overview from Google: Meta brings NotebookLlama
After Google's audio overviews from NotebookLM caused a sensation online, Meta is now presenting an open source competitor. It doesn't sound quite as smooth yet
After Google equipped its AI-supported notepad NotebookLM with the Audio Overviews function in September, there was no stopping it on X and in other social media. Numerous users wanted to try out how they could create their own podcast from just a PDF or a single URL – and later from a YouTube video with transcript, which also sounds very realistic. Some people have their physics paper explained to them, journalists their credit card bill.
Now Google's two AI presenters are facing competition from Meta: the Facebook parent company has unveiled its own podcast generator. It goes by the name NotebookLlama and is based on Meta's own language model Llama-3.1-70B, including speech generation. In contrast to Google's audio overviews, the Zuckerberg version is open source and the code is already available on GitHub. It can therefore serve as a starting point for your own developments. The results that NotebookLlama delivers – which would actually be better called "Audio Overview Llama" as it only contains the podcast generation and not the entire NotebookLM range of functions – are still comparatively weak. In contrast to the Google version, the voices often seem unnatural, there are artifacts and the two presenters, a woman and a man by default, don't really get going. There is a lack of emotion and the intonations sometimes sound "off".
PDF or printed website as input
Initially, PDFs are used as input, which are converted into plain text. If you want to use a website as input, you have to save it as a PDF –. However, this problem also regularly occurs with NotebookLM, as only websites that do not prohibit Google's AI crawler are allowed as input. Llama-3.1-70B then generates a script for the podcast, which in turn is further enhanced via Llama-3.1-8B to create a more human-sounding dialog. Finally, the audio is generated using Parler-tts and Suno.
Theoretically, it is also possible to run NotebookLlama on your own computer. However, according to the creators, you should then use Llama 8B or lower for the entire pipeline. Otherwise, a GPU server or a Llama-capable API provider is mandatory. The requirements are typically high for AI: using 70B requires a GPU with 140 GB of aggregated memory (accuracy: bfloat-16). The researchers at Meta, including Vikas Sharma, admit that their project still has some catching up to do. Currently, the text-to-speech model has a rather robotic sound. "That's the limitation of how natural [the output] sounds." In addition, the script could be more exciting if it were written by two agents debating with each other. "Currently, we only use a single model to write the podcast outline."
Google has brought actors into the studio
Google also brought in experts for Audio Overviews. These include bestselling author Steven Berlin Johnson, who is the creative director and comes from the content industry. NotebookLM project manager Raiza Martin also told heise online that the two AI podcast presenters are not using purely artificial voices, but have brought voice actors into the studio. In the future, NotebookLM wants to earn money with a business offer, for which a preview phase was recently launched for selected testers. In addition, users can now partially customize the audio overviews via prompt.
NotebookLlama is not the first attempt to copy Google's podcast generator. The Open NotebookLM project is also open source and uses Meta's Llama 3.1 and MeloTTS. However, testers complain that the software has a greater tendency to hallucinate than Google's original. NotebookLM and Audio Overviews try to get around this problem by ensuring that the output is always based as closely as possible on the templates; knowledge of the world is of secondary importance for the model. However, errors also occur with the audio overviews. Machine learning expert Iwona Bialynicka-Birula fed her doctoral thesis from 2008 into it back in September and found that the podcast was full of "nonsensical analogies" and repetitions "in 1000 different ways".
(mho)