Meta FAIR: Watermarks for videos and virtual agents with legs
Meta's scientific AI team has released a series of improvements for common AI models. All freely available.
(Image: everything possible/Shutterstock.com)
Meta Video Seal is a new type of watermark for AI videos, Meta Motivo is an AI model that controls agents in virtual spaces. Under Flow Matching, Meta summarizes methods that are intended to replace previous diffusion models. And then Meta also gives an update on how they believe they can make AI smarter. The FAIR team – Fundamental AI Research – reveals a number of new developments.
With Meta Video Seal, Meta wants to at least minimize the risks of AI being misused. Watermarks are a necessary step towards making content and AI models traceable. The new method is a "comprehensive framework for neural watermarks in videos". Invisible to the eye, but robust against common video edits that could obscure the origin. These include video cropping or video compression when the content is uploaded to social media. The research paper, training code and inference code are freely available.
Video Seal is already available as Audio Seal. Meta's watermarking tool, Meta Watermark Anything Model, also comes under a free license.
Videos by heise
Meta Motivo for virtual agents
AI agents are set to take over numerous tasks for humans in the future. All major AI providers are working on this. Meta is now launching an AI model that can be used to control virtual agents, i.e. agents that are designed to have a body. Meta Motivo has been trained on a new type of algorithm that uses a data set of movements. The human-like behaviors are then to be learned by reinforcement learning, in which the models are geared towards rewards for correct behavior. The transfer is new. Meta writes in the blog post: "The key technical novelty of our algorithm is to learn a representation that can be used to embed states, motions, and rewards into the same latent space."
(Image: Meta Blogbeitrag)
Until recently, it was something of a running joke that the avatars in the Metaverse had no legs, but now it should be easy to use the model to recreate realistic-looking movements from head to toe. According to Meta, the processes are also extremely robust in various conditions such as wind or other disturbances. In addition to the metaverse, the company also envisages that the technology could be used for so-called NPCs, the abbreviation for non-playable characters in video games.
Flow matching instead of diffusion
The first image generators were based on so-called diffusion models. These are increasingly being replaced or expanded. Meta uses the term flow matching to summarize a paradigm that can be used to generate different content. Meta Movie Gen, Meta Audiobox and Meta Melody Flow, among others, are said to have already been converted to the new technology. However, Stable Diffusion-3, Fold-Flow, Physical Intelligence Pi_0 and the image generator Flux also appear to use flow matching. Flux-1 from Black Forest Labs was previously responsible for image generation in xAI's Grok, but the model was only recently replaced by the in-house Aurora.
Flow Matching extends diffusion models with Continuous Normalizing Flows (CNFs). This effectively shortens the process of image creation based on probabilities. Meta has also published everything you need to know about this.
This also applies to a new theory-of-mind called Meta Explore Theory-of-Mind. "Our novel framework enables the generation of diverse, sophisticated and scalable ToM data for both training and evaluation, which will accelerate progress in this important area of research," writes Meta.
A Large Concept Model (LCM) is intended to decouple language ability from thinking. Meta explains this with the example of a presenter who always wants to convey the same content in the same lecture, but whose choice of words changes. This paradigm should then no longer predict a subsequent token, but a subsequent idea or content. As a result, the models should be able to summarize content much better and be more efficient overall.
To avoid the problem that large language models cannot handle individual letters or numbers, Meta wants to replace tokens with bytes: Meta Dynamic Byte Latent Transformers should then also be able to spell or count.
(emw)