Meta publishes V-Jepa 2 – an AI world model

V-Jepa 2 is an AI world model that works differently to large language models – it learns independently. Meta sees this as the future.

listen Print view
Yann LeCun from Meta's FAIR team.

Yann LeCun from Meta's FAIR team in the video of the blog post.

(Image: Screenshot Blogbeitrag)

5 min. read

Like all major AI companies, Meta is working on AGI (Artificial General Intelligence). However, Meta's AI science team FAIR from Paris is also developing a different approach: AMI. Advanced Machine Intelligence is the goal of chief scientist and Turing Prize winner Yann LeCun. Meta has now presented V-Jepa 2, a new world model that is intended to be the next step towards AMI and useful AI agents. V-Jepa stands for Video Joint Embedding Predictive Architecture.

V-Jepa 2 should be able to acquire knowledge in the same way that humans do. This in turn should help the AI models to adapt to an unpredictable environment. V-Jepa has been trained on video data and was presented last year. Building on this, V-Jepa 2 now has the “ability to predict actions and model the world”, according to Meta. Thanks to the model, robots should be able to deal better with unknown objects and better understand their surroundings.

Videos by heise

V-Jepa 2 was trained using self-supervised learning on video data. This means that it is not necessary for the data to be annotated by humans. This means that the data does not have to be elaborately processed. Meta explains in the blog post that the training then takes place in two phases: firstly, action-independent pre-training and then action-based fine-tuning. The model has 1.2 billion parameters and is publicly available. According to Meta, V-Jepa 2 already allows robots to be used for zero-shot planning in previously unknown environments for objects that were not used during training. Grasping, picking up and placing elsewhere is no problem.

Meta is already working in the Paris office with a spot from Boston Dynamics that can search for objects, pick them up and take them somewhere else. It is instructed to do this via a quest: the wearer can then see which steps the robot dog wants to take and can intervene if necessary.

Robot dog Spot is looking for a plush pineapple.

(Image: emw)

The special feature of V-Jepa 2 is its understanding of the environment and therefore the physical world. Yann LeCun has already said several times that he considers the generative AI approach to be unsuitable when it comes to developing an AGI or AMI. It works for text because there is a finite number of symbols. “If your goal is to train a world model for recognition or planning, using pixel predictions is a terrible idea,” LeCun commented on OpenAI's Sora video generator.

According to LeCun, language can never represent the real world. We can imagine things without it having anything to do with language, he explains in a new video published in the blog post on V-Jepa 2. A world model should be more like a digital twin of the real world. People learn how the world works as toddlers – by observing it and even before they can speak. Meta tries to recreate this. So that predictions can be made about what happens in the physical world, such as when a ball falls. It falls and doesn't suddenly fly up again.

In the blog post, it sounds as if you can only develop really helpful AI agents with such an understanding. However, there are apparently also people at Meta who see things differently. Mark Zuckerberg himself is currently said to be recruiting a team in San Francisco to work on the development of generative AI and an AGI. Little is known about the content. However, the media report that Zuckerberg is inviting potential candidates to join him. It is also reported that Alexandr Wang and his company Scale AI are to be taken over by Meta. Scale AI primarily offers data sets that have been prepared for AI training. Exactly what is not necessary for training Jepa.

Meta has also released two new benchmarks to test the physical understanding of models: IntPhys 2 is said to have been developed to measure the ability of models to distinguish between physically plausible and implausible scenarios. It builds on the earlier IntPhys benchmark.

Minimal Video Pairs (MVPBench) measures the physical comprehension capabilities of video language models using multiple-choice questions. “Unlike other video question answering benchmarks in the literature, MVPBench is designed to avoid common shortcuts observed in video-based language models, such as reliance on superficial visual or textual cues and biases.”

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.