Like 2021 with LLMs: Google researchers on the future of world models

In an interview, Google's Genie researchers discuss real-time simulation, the integration of Street View, and why robots need a world model.

listen Print view
Demonstration of Project Genie at Google I/O

Demonstration of Project Genie at Google I/O

(Image: heise online / Malte Kirchner)

5 min. read
Contents

World models – the name itself sounds monumental. With real-time AI models like Project Genie, Google goes far beyond what photo and video AI models can do. These provide snapshots, but not a replica of the world that dynamically adapts to interactions. Many might initially think of future game worlds that enable anyone to create their desired games on command. However, the research team is primarily concerned with something else: they are thinking primarily of applications in robotics or as a simulator for simulating disasters.

At the Google I/O developer conference, the extension of the 3D world generator with real locations from Street View was announced. In an interview with heise online, Genie researchers Jack Parker-Holder and Diego Rivas, Group Product Manager at Google DeepMind, explained the current status of the model.

The approach sounds simple but is technically demanding: Genie learns how a world changes depending on actions. You press a button – left, right, forward – and the model calculates the next frame of the world. “It's more of a language model than a classic video model,” explains research lead Jack Parker-Holder. Classic video generators produce an entire video at once – Genie generates frame by frame, causally and interactively.

Empfohlener redaktioneller Inhalt

Mit Ihrer Zustimmung wird hier ein externes YouTube-Video (Google Ireland Limited) geladen.

Ich bin damit einverstanden, dass mir externe Inhalte angezeigt werden. Damit können personenbezogene Daten an Drittplattformen (Google Ireland Limited) übermittelt werden. Mehr dazu in unserer Datenschutzerklärung.

The result is not a video game in the classic sense, but a novel type of model: a kind of universal simulator that can create any imaginable world with a text prompt – from historical scenarios to disaster areas.

New in Genie 3 is the integration of Google Street View. Users can now choose real locations as a starting point; the model generates an interactive world from there. According to Diego Rivas, the impetus for this came from the users themselves: they repeatedly confronted the system with prompts like “take me to New York” or “show me my hometown.” Street View now provides the geographical anchor from which Genie continues to generate. For now, US locations are available, with global expansion planned.

Videos by heise

Jack Parker-Holder

(Image: heise online / Malte Kirchner)

Genie 3 runs in real-time – and this with a model that simultaneously has long-term memory, high output resolution, and broad generalization capabilities. Parker-Holder describes this as “technically very demanding”: a user's button input must travel over the network to a TPU cluster, be processed there, and return as a rendered frame – all with minimal latency.

There is still a significant gap to the real world: moving people, ambient sounds, 4K resolution – all of this is beyond current capabilities. “But we have pretty good ideas for the next few steps,” says Parker-Holder.

What distinguishes Genie from other AI projects: the same model base drives very different applications. Waymo uses it to simulate rare traffic scenarios – such as an elephant on the road or a tornado. Another application is training complex robots. Instead of trying something millions of times and failing, a task can be learned correctly faster this way.

Diego Rivas

(Image: heise online / Malte Kirchner)

In the long term, the team sees World Models as an indispensable foundation for embodied AI. Robots need to operate in the real world, so they need realistic simulation for training.

Currently, robotics teams are still facing the so-called "Control Problem": Can a robot reliably grasp any object, walk on any surface? Only when that is solved does the next challenge come to the fore – social intelligence, understanding human behavior in unpredictable situations. This is precisely where Parker-Holder sees the greatest potential of World Models.

In terms of market comparison, the team soberly assesses the situation: “Compared to LLMs, we are in 2021.” Many players are building very different things under the term "World Model.” Direct comparability is hardly possible. In the coming years, Parker-Holder expects consolidation – and a few major players who will shape the market. At Google I/O, in addition to Genie 3, new language models were also presented: Gemini 3.5 Flash and Gemini Omni Flash are expected to handle video generation and autonomous agent tasks in the future.

(mki)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.