Google Real-Time Translator: More Than Word-for-Word Translations
Google's real-time translator looks ahead and anticipates what is being said, explains Niklas Blum, Director Product Management.
(Image: Familiy Stock / Shutterstock.com)
Speaking English in the middle of a meeting while the other person hears the words in perfect Spanish – in real time. What was long considered science fiction, Google is now rolling out with Meet and on Pixel devices. Behind the feature is the same AI that also powers Gemini – and a complex interplay of specialized translation models and generative language modules.
We spoke with Google's Director Product Management, Niklas Blum, about the underlying technology: How does the end-to-end speech translation that even preserves the original voice work? What role does the team in Germany play – and what are the biggest hurdles in translating spoken language?
With Google Meet, anyone can speak in a different language – in real time. This works using AI, specifically it's the same translator as in Gemini. How does it work? What happens in the model?
Currently, we use specialized models for translation and Gemini for speech generation. This architecture relies on the AudioLM framework as well as Transformer blocks and is designed to process continuous audio streams. This allows the model to decide independently when the translation is output. We recently published a technical research blog post that explains how this end-to-end Speech-to-Speech Translation (S2ST) works while preserving the original voice.
How is it that a team from Germany is working on the real-time translator?
Our team and our customers are globally positioned. The teams working on this feature are distributed across Berlin, Stockholm, Zurich, New York, and Mountain View. The Google team in Stockholm is a central hub for Google's real-time communication.
What can the real-time translator be used for so far? It's available in Google Meet, but where else is it used, and what is planned?
The real-time translation technology is also available on Pixel 10 devices for calls and in Google Translate. In Google Meet, we are specifically focusing on use cases for real-time conversations in companies that operate in different markets and deal with language barriers. We believe that this technology, even though it is still in its early stages, will develop rapidly. Real-time translations have the potential to connect people and enable conversations that were hardly possible before.
The translator is a “lookahead”
Spoken language is more error-prone than written language. How does the model handle this? Is it translated one-to-one, meaning every "um" and, in case of doubt, an unfinished sentence? Because that's how we sometimes speak. Or does the real-time translator also draw conclusions and essentially clean up the language?
Our real-time translation model uses Transformer blocks and consists of two main components: a streaming encoder that summarizes the source audio data based on the preceding ten seconds of input, and a streaming decoder. The latter autoregressively predicts the translated audio, using the compressed encoder state and predictions from previous iterations.
The Transformer blocks allow the model to independently decide when to output the translation. Based on the training data, the model is capable of going beyond pure word-for-word translations. This is particularly helpful with idioms or recognizing proper nouns. Terms like the "Golden Gate Bridge" are not translated.
What is the biggest difficulty in translating spoken language? Where does it sometimes still falter?
In translating spoken language, three essential challenges compete: We want the highest possible translation quality, minimal delay, and at the same time preserve the original voice characteristics. For real-time conversations, a standard delay of two seconds is currently used, which works well for most languages. A longer "lookahead" by the model would improve translation quality through additional context, but it impairs the real-time experience. Achieving optimal translation quality in the shortest possible time remains the central challenge and an area for further improvement.
Videos by heise
In general, advances in AI audio processing and model quality have made great leaps recently. This is likely why there is increasing integration of language translation into various products across the industry.
There was a time when Google and other providers did not release similar translation tools because of the threat of misuse. What has changed?
We are now integrating this feature into our products because the technology has made a huge leap forward. I believe that until recently, it was not possible to develop truly high-quality dialogue-oriented services that meet the required quality standards.
What about the dangers of misuse, of deepfakes? What protective measures are there?
We are, of course, obligated to comply with applicable data protection laws. Over the years, we have worked closely with data protection authorities around the world and implemented strict data protection measures. For example, we have clear guidelines for Meet on how our tool may be used. Users are not permitted to use Meet to impersonate another person, for example.
Technically, the translation function works similarly to existing audio encoding, only with the additional function of translation. Every sound sent to the model generates an output. The model operates with a 10-second context window and has no semantic perception of the spoken content outside this window.
(emw)