Meta FAIR: Better eyes and better understanding of AI models

Advanced Machine Intelligence (AMI) is Meta's goal. New computer vision systems and a collaborative reasoner framework are available for this purpose.

listen Print view

(Image: Meta)

5 min. read

Meta's FAIR team publishes several research results and advances in the field of AI. The aim is to give scientists easy access in order to create an open ecosystem around AI that focuses on progress and discovery. Specifically, this involves a Meta Perception Encoder as well as a Collaborative Reasoner and 3D object recognition.

The Perception Encoder is a large-scale image processing encoder. As Meta describes it in a blog post, it is the "eyes" of an AI system. The encoder enables it to process visual information. The new Perception Encoder is particularly good at classifying images and videos. It can also perform difficult tasks, such as recognizing a stingray that has buried itself in the seabed. The capabilities can also be transferred to the downstream speech process. This means that if the encoder is used, an AI system can also answer questions about an image particularly well. Meta has published the model, the code, the data set and a paper on the Perception Encoder.

The Collaborative Reasoner is based on the idea that people produce better results when they work on something together. The challenge is that this requires social skills. The framework is now intended to help improve these collaborative skills of a language model. It includes a series of tasks that require two agents. However, because this obviously does not work well, it is intended that an LLM agent works together with itself, i.e. takes on both roles. The code is available on Github.

(Image: Meta)

Meta has also published a Perception Language Model (PLM). This is a model for visual language and visual recognition tasks. The PLM was trained with both synthetic data, i.e. AI-generated data, and open data sets. Meta's FAIR team then determined which data was missing for image understanding. These gaps were filled with 2.5 million new videos labeled by humans. We don't know who the people were who labeled the videos. The result is the largest data set of its kind, writes Meta.

Videos by heise

The paper emphasizes that no model distillation was used. This would have meant that the data set used to train the large teacher model would not have been known. However, Meta states that the entire data package is freely available. The PLM is available in variants with 1, 3 and 8 billion parameters, making it well suited for "fully transparent academic research". The release is also accompanied by a new benchmark that Meta is making available: PLM-VideoBench.

Meta Locate 3D is a model that can identify objects – derived from speech. As an example, Meta writes that you can ask a robot to bring you a red cup from the table. The robot or the model behind it must understand what a red cup and the table are – and then work through a sequence of steps to grab and bring the cup. "For AI systems to support us effectively in the physical world, they need to have a 3D understanding of the world based on natural language."

(Image: Meta)

To recognize objects, the model uses sensors to create a structure of points. Among other things, this is based on Meta I-JEPA, a model that learns to recognize objects in an abstract way. To find the right object, contextual information is then added, such as "vase near the TV", so that the robot does not pick up the "vase from the windowsill". Here too, Meta FAIR publishes the paper, data and model on Github.

Meta tests the interactions between humans and robots directly with Spot from Boston Dynamics, for example. One is already running around the Meta FAIR office, taking cuddly toys from one place to another, for example. The robot dog is still instructed via a Quest headset. The associated framework is called Partnr.

Meta FAIR works in a very science-oriented way. The work is generally made available as open source, although the scope and licenses vary. The aim of Meta FAIR is to achieve Advanced Machine Intelligence (AMI) that can help people with everyday tasks.

(emw)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.