Google's Audio Overviews: The podcast that doesn't want to be a podcast
The Audio Overviews function as part of the Google Labs product NotebookLM is fascinating the web. Product Lead Raiza Martin in interview about the background.
Podcast made of zeros and ones (symbolic image).
(Image: ymgerman/shutterstock.com)
NotebookLM is an application with a pointed target group: students, book authors and other knowledge workers can use the previously free Google product, which received little attention for a long time, to manage a lot of research material, analyze it using AI and ask questions via a Gemini-based chatbot.
However, since the audio review function was released in September, the hype has been unstoppable: it makes it possible to produce an audio program from just one document –, from a boring PDF to a credit card statement to an entire book –, in which two AI moderators discuss it. Most recently, the option was added to steer the audio overviews in a specific direction using your own prompts, and Google is also planning to turn NotebookLM into a business customer product, with a preview phase starting soon.
When the audio overviews appeared, social media quickly became full of examples that amazed listeners. The feature was also discussed on heise online in the #heiseshow and in a commentary. But what exactly is behind Audio Overviews and how does Google proceed technically? Raiza Martin, responsible Product Lead at NotebookLM, gave us some information in an e-mail interview.
(Image: privat)
heise online: The Audio Overviews voices seem very natural. Have they been created completely synthetically or are they based on the voices of actors, i.e. voice actors?
Raiza Martin: The voices are actually based on those of voice actors. Google has a long tradition of hiring people and licensing their voices.
The output is very podcast-like, reminding me of classic programs such as those from the US broadcaster NPR. What exactly has the system been trained to do?
We don't currently give out details about the specific training data we used for our audio model, but the audio overviews are designed to make the content as interesting and accessible as possible.
Our team also made some editorial decisions to make it an engaging listening experience. Elements such as banter between the presenters, storytelling and question-and-answer formats are incorporated. It is also important to emphasize that NotebookLM is not trained on data uploaded by users.
Can you explain the different stages of a "production" when Audio Overview is presented with source material such as a PDF?
We're not naming these steps publicly right now, but we rely heavily on Gemini 1.5 Pro to create an authentic, natural-sounding conversation that users will find engaging.
Your colleague Steven Berlin Johnson, who is part of the NotebookLM team and has worked as a bestselling author himself, spoke in an interview about the so-called disfluency step that the audio production goes through at the end. Can you explain what that means?
Ah, that's a typical "Steven" term! We do a lot to make the audio overview conversations feel natural, and we've found that this includes paying special attention to creating areas of the conversation that a more traditional text-to-speech system would consider imperfect.
A natural conversation is full of subtle swerves, pauses or even non-words, but we think these are important to find the right pace for listeners when they are first learning about a new topic.
Videos by heise
I've experimented a lot with Audio Overview myself, as has probably half the internet. The programs usually turn out to be amazingly insightful for the input material and come to amazing conclusions. Is this all the work of Gemini Pro?
Gemini 1.5 Pro is the "workhorse" behind NotebookLM, and we also use other models to bring the Audio Overviews to life. It's important to note that we refer to the feature itself as a "listening overview" and not a "podcast".
We see it as a personalized tool for understanding information from user-supplied content. It is not about creating content of general interest, as a podcast is.
If a link to a website is used as input instead of a PDF file or text, does Audio Overview follow the links on that website? Or does it only use the material that is already in the model?
No, Audio Overview within NotebookLM does not follow any links on an uploaded website. It only examines the content of the source page you provide.
How much pure programming was required for Audio Overview? Is Gemini Pro's careful prompting almost as important as the coding part?
The magic of our product comes from the combination of the powerful underlying features of the models plus the clever application of those features. This includes prompting as a crucial element, but also many others.
Will there be more voices in the future? And also the option to use your own?
We listen very carefully to user feedback and are actively working on improving the overall user experience. However, we can't reveal anything concrete at this stage.
How has the podcasting industry reacted to the feature?
Audio Overview is a tool to better understand information in sources that the user gives us. We see great value in generating an audio discussion, whether the input is a long email thread from work, notes from a community meeting or your own CV, which the AI presenters then talk about their impressive achievements. None of this would ever be a real podcast.
That's why our listening digests remain a unique way to explore personal source information, but not a substitute for podcasts.
Some observers felt that Google had its own ChatGPT moment with Audio Overviews – a moment when users realized that a fundamental shift was happening because the technology was so amazing. Again, the origin story is interesting: it was in a rather obscure tool like NotebookLM as an add-on feature. Was it a good thing that the tool had the time to mature in a giant corporation like Google?
That's what Google Labs is all about – testing and developing new ideas and products. We're obsessed with solving problems that frustrate people and developing products and tools to do that. In the Labs, we have the space to follow our curiosity and experiment with new product concepts to get there.
It's clear that great products once started small. But for a big company like Google, with its hugely successful products and business units, it's really hard to do small things.
In the first demo of the Audio Overviews at Google I/O in May, the most important "pop feature" was the ability to interact directly with the AI presenters. Will this really exist and work in real time? So far, you can generate a listening overview in about five minutes.
We are actively working on new features for our users and look forward to telling you more about them in due course.
How expensive is a product like Audio Overviews in terms of server performance? Could a start-up company implement something like this or did it have to be Google?
I'm afraid I can't give you any details on that.
NotebookLM, which is the basis of Audio Overviews, sometimes does not allow the import of content due to "source restrictions". What are these? Websites that cannot be indexed by Google's AI bot?
This can be triggered by content that is behind a paywall. We also respect the common practice of excluding all websites that have opted out of being crawled by the web crawler.
(emw)