Open Source

Advancing AI systems through progress in perception, localization, and reasoning

April 17, 2025
11 minute read

Takeaways


  • Meta FAIR is releasing several new research artifacts that advance our understanding of perception and support our goal of achieving advanced machine intelligence (AMI).
  • The work we’re sharing includes the Meta Perception Encoder, aimed at building more advanced computer vision systems that can assist people in everyday tasks, such as image recognition and object detection. We’re also sharing advancements in 3D scene understanding and localizing objects from natural language queries—all important developments on the path toward achieving more sophisticated AI systems.
  • We’re also introducing Collaborative Reasoner, a framework to evaluate and improve the collaborative reasoning skills of large language models, which is an important step toward building collaborative social agents.
  • By making our research widely available, we aim to provide easy access for the research community and help foster an open ecosystem for AI that accelerates progress and discovery.

As we work toward achieving our goal of advanced machine intelligence (AMI), it’s important to have models, benchmarks, and datasets that focus on perception. We need machines that are able to acquire, process, and interpret sensory information about the world around us and are able to use this information to make decisions with human-like intelligence and speed. Today, we’re excited to publicly release five new works from our Meta Fundamental AI Research (FAIR) team that bring us closer to that goal.

Meta Perception Encoder: Setting new standards for language aligned vision modeling

We’re excited to introduce Perception Encoder, a large-scale vision encoder that excels across several vision tasks for images and video. Vision encoders act as the “eyes” that enable AI systems to interpret visual information and better understand the world. As AI systems become more advanced, it becomes even more challenging to build a vision encoder that meets all expectations for advanced intelligence. In order to achieve this, a vision encoder should connect vision and language, perform well on images and videos, and be robust against various challenging and potentially adversarial conditions. A vision encoder should also be able to recognize a broad range of concepts while being perceptive enough to distinguish subtle differences, such as different species of animals.

Perception Encoder demonstrates exceptional performance on image and video zero-shot classification and retrieval, surpassing all existing open source and proprietary models for such tasks. It also works particularly well on “hard” tasks, such as recognizing a stingray burrowed under the sea floor, identifying a tiny goldfinch in the background of an image, or catching a scampering agouti on a night vision wildlife camera.

These strong perception abilities transfer to downstream language tasks. After aligning to a large language model, Perception Encoder surpasses all other vision encoders for image and video visual question answering, captioning, document understanding, and grounding. Perception Encoder also enables significant improvements on traditionally hard tasks for language models, such as telling if one object is behind another or if the camera is moving clockwise around an object.

As Perception Encoder begins to be integrated into new applications, we’re excited to see how its advanced vision capabilities will enable even more capable AI systems.

Download the model

Download the code

Download the dataset

Read the paper

Meta Perception Language Model: Enhancing our understanding of visual perception tasks

Continuing our work on perception, we’re releasing the Perception Language Model (PLM), an open and reproducible vision-language model to tackle challenging visual recognition tasks.

We trained PLM using synthetic data generated at scale and open vision-language understanding datasets, without any distillation from external models. We then identified key gaps in existing data for video understanding and collected 2.5 million new, human-labeled fine-grained video QA and spatio-temporal caption samples to fill these gaps, forming the largest dataset of its kind to date.

PLM is trained on this massive dataset, using a combination of human-labeled and synthetic data to create a robust, accurate, and fully reproducible model. PLM offers variants with 1, 3, and 8 billion parameters, making it well suited for fully transparent academic research.

We’re also sharing a new benchmark, PLM-VideoBench, which focuses on tasks that existing benchmarks miss: fine-grained activity understanding and spatiotemporally grounded reasoning. We hope that our open and large-scale dataset, challenging benchmark, and strong models together enable the open source community to build more capable computer vision systems.

Download the model

Download the code

Download the dataset

Read the paper

Meta Locate 3D: A new frontier in open-vocab object localization

Imagine saying, “Hey robot, bring me the red cup on the table,” and having a robot complete the task. For AI systems to effectively assist us in the physical world, it’s essential that they have a 3D world understanding grounded in natural language. To perform such tasks, a robot needs to first localize the object in the 3D environment, navigate to it, and pick it up.

To address this, we built Meta Locate 3D, an end-to-end model that can accurately localize objects from open-vocabulary queries. Meta Locate 3D directly operates on 3D point clouds from RGB-D sensors that are received from a robot. When given a text prompt, such as “flower vase near TV console,” Meta Locate 3D takes into account spatial relationships and context to identify the specific object instance, such as “vase near TV,” not “vase on the table,” and can pinpoint the exact location of the item.

Meta Locate 3D consists of three key components:

  • A pre-processing step that first lifts 2D foundation features to 3D featurized point clouds.
  • The 3D-JEPA encoder, a pre-trained encoder that takes the featurized point clouds as input and predicts a contextualized, smoothed representation of the 3D world.
  • The Locate 3D decoder, which takes the 3D-JEPA representation and a language query and produces both bounding boxes and masks for the specified objects.

We’re also releasing a new dataset for localization of objects based on referring expressions. This dataset includes 130,000 language annotations across three widely used datasets—ARKitScenes, ScanNet, and ScanNet++—and covers 1,346 scenes, effectively doubling the existing data annotations.

By enabling robots to accurately understand their surroundings and ground their understanding in natural language, Meta Locate 3D supports the development of more sophisticated and capable robotic systems, including Meta PARTNR. With Meta Locate 3D, humans can naturally interact with robots to request or collaborate on tasks, which marks an exciting step forward in the pursuit of more intelligent and autonomous machines.

Download the model

Try the demo

Download the dataset

Read the paper

Dynamic Byte Latent Transformer: Redefining efficiency and robustness standards

Following the publication of our research paper in late 2024, by popular demand, we’re releasing model weights for our 8B parameter Dynamic Byte Latent Transformer. This research marks a significant advancement in byte-level language model architectures, achieving performance at scale that matches traditional tokenization-based language models for the first time. This technology enhances inference efficiency and significantly improves robustness.

The Dynamic Byte Latent Transformer architecture outperforms tokenizer-based models across various tasks, with an average robustness advantage of +7 points (on perturbed HellaSwag), and reaching as high as +55 points on tasks from the CUTE token-understanding benchmark. This highlights the potential of Dynamic Byte Latent Transformer to redefine the standards for language model efficiency and reliability, offering a compelling alternative to traditional tokenization methods.

With this new model, and our previously released codebase, we encourage the community to explore new ideas, hopefully paving the way for even more groundbreaking developments in the field of language modeling.

Download the model

Download the code

Read the paper

Collaborative Reasoner: Self-improving Social Agents with Synthetic Conversations

When humans collaborate, we often achieve stronger outcomes together. Akin to such human teaming, our goal is to develop social AI agents that can collaborate with humans or other AI agents to accomplish tasks better than a single agent or human. Imagine an agent that helps you understand a difficult homework assignment or helps you prepare for a job interview. These collaborations are challenging because, in addition to problem-solving, they require social skills such as effective communication, providing feedback, having empathy, and theory-of-mind. Furthermore, this kind of collaboration typically manifests over multiple turns of back-and-forth natural conversation. Current LLM evaluation benchmarks and training pipelines don’t consider these kinds of collaborative and social skills. Collaborative back-and-forth conversational data is expensive to gather, domain-specific, and less controllable, making both evaluation and training difficult.

To address these challenges, we built Collaborative Reasoner, a framework to evaluate and improve the collaborative reasoning skills of language models. Collaborative Reasoner includes a suite of goal-oriented tasks that require multi-step reasoning that needs to be accomplished collaboratively by two agents via a multi-turn conversation. The tasks and metrics in Collaborative Reasoner require agents to disagree on solutions, convince their partner of a correct solution, and ultimately agree on the best solution as a team.

Our evaluation shows that current models can’t consistently utilize collaboration to achieve better task performance. To improve collaborative reasoning capabilities of LLMs, we propose a self-improvement approach using synthetic interaction data sampled with self-collaboration—in other words, an LLM agent collaborating with itself. To enable the generation of such data at scale, we also develop a versatile, high-performance model serving engine for large-scale inference, called Matrix: Multi-agent data generation infra and experimentation. On math (MATH), scientific (MMLU-Pro, GPQA), and social reasoning (ExploreToM, HiToM) tasks, our approach yields improvements up to 29.4% over chain-of-thought performance of an equivalent single agent LLM.

Collaborative Reasoner paves the way for developing social agents that can partner with humans and other agents. We’re open sourcing our data generation and modeling pipeline to support further research in this area.

Download the Collaborative Reasoner code

Download the MATRIX code

Read the paper

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.