Open Source
Sharing new research, models, and datasets from Meta FAIR
December 12, 2024
11 minute read
 

Takeaways

  • Today, Meta FAIR is releasing several new research artifacts that highlight our recent innovations in developing agents, robustness and safety, and architectures that facilitate machine learning.
  • The work we’re sharing advances our goal of achieving advanced machine intelligence and includes Meta Motivo, a foundation model for controlling the behavior of virtual embodied agents, and Meta Video Seal, an open source model for video watermarking.
  • We aim to democratize access to state-of-the-art technologies that transform our interaction with the physical world, which is why we're committed to fostering a collaborative and open ecosystem that accelerates progress and discovery.

As we continue to work towards our goal of achieving advanced machine intelligence, we want to share our progress with the research community so they can build upon our work. Today, we’re excited to release some of the latest research, code, models, and datasets from Meta Fundamental AI Research (FAIR). The artifacts we’re sharing today focus on building more capable agents, robustness and safety, and architecture innovations that enable models to learn new information more effectively and scale beyond current limits.

In this release, we’re sharing a demo and code for Meta Video Seal, an open source model work video watermarking that builds on the popular Meta Audio Seal work we shared last year. We’re also sharing a variety of other artifacts, including a foundation model for controlling the behavior of virtual embodied agents, a method for scaling memory layers that will enable more factual information, and code to help models become more socially intelligent. There’s plenty more to explore in this post with nine total projectsand artifacts ready for people to download and start using today.

This work supports our long and proven track record of sharing open reproducible science with the community. By publicly sharing our early research work, we hope to inspire iterations and ultimately help advance AI in a responsible way. As always, we look forward to seeing what the community will build using these new releases and continuing the dialogue about how we can all advance AI together responsibly and build for the greater good.

Meta Motivo

Unsupervised reinforcement learning involves pre-training models to solve a wide range of downstream tasks in complex environments. Most methods require highly curated interaction datasets and often rely on unsupervised losses that lead to policies that may not align well with target tasks. Today, we’re sharing Meta Motivo, a first-of-its-kind behavioral foundation model that controls the movements of a virtual embodied humanoid agent to perform complex tasks.

Meta Motivo is trained with a novel algorithm that leverages an unlabeled dataset of motions to ground unsupervised reinforcement learning towards learning human-like behaviors while retaining zero-shot inference capabilities. The key technical novelty of our algorithm is to learn a representation that can be used to embed states, motions, and rewards into the same latent space. As a result, Meta Motivo is able to solve a wide range of whole-body control tasks, including motion tracking, goal pose reaching, and reward optimization, without any additional training or planning.


Meta Motivo achieves competitive performance compared to task-specific methods and outperforms state-of-the-art unsupervised reinforcement learning and model-based baselines, while exhibiting more human-like behaviors. The model also displays a surprising level of robustness to changes in the environment, such as gravity, wind, or direct perturbations, despite not being trained for them.

In the future, we believe this research could pave the way for fully embodied agents in the Metaverse, leading to more lifelike NPCs, democratization of character animation, and new types of immersive experiences.

Read the paper

Try the demo

Download the code and model

Meta Video Seal

While AI tools can help bring the world closer together, it’s important that we implement safeguards to mitigate the risks of imitation, manipulation, and other forms of misuse that can undermine their benefits. Post-hoc watermarking is a crucial step towards better traceability for content and AI models.


Today, we’re releasing Meta Video Seal, a state-of-the art comprehensive framework for neural video watermarking. Video Seal adds a watermark (with an optional hidden message) into videos that is imperceptible to the naked eye and can later be uncovered to determine a video’s origin. The watermark has proven resilience against common video editing efforts like blurring or cropping, as well as compression algorithms commonly used when sharing content online. We’re publicly releasing the Video Seal model under a permissive license, along with a research paper, training code, and inference code. A demo is also available to try the model out interactively.

Along with Video Seal, we’re also releasing Meta Omni Seal Bench, a leaderboard dedicated to neural watermarking covering several modalities, enabling the research community to easily test and add their own work in the field. We’re also re-releasing our Meta Watermark Anything model under a permissive license and will organize a workshop on watermarking at ICLR in 2025.

This research is a testimony to our commitment to responsible AI. We hope that other researchers and developers will join our efforts by integrating watermarking capabilities when building generative AI models. Watermark Anything, Video Seal, and Audio Seal—our previous work on post-hoc audio watermarking—are now all available for download and ready to be integrated.

Read the paper

Try the demo

Download the Video Seal code and model

Download the Watermark Anything code and model

View the Omni Seal Bench leaderboard

Flow Matching guide and codebase, a Meta FAIR release

Flow Matching is a state-of-the-art generative paradigm for many modalities including generation of images, videos, audio, music, 3D structures like proteins, and more. Our method has already replaced classical diffusion in many generative applications at Meta, including Meta Movie Gen, Meta Audiobox, and Meta Melody Flow, and across the industry in works such as Stable-Diffusion-3, Flux, Fold-Flow, and Physical Intelligence Pi_0. Flow Matching provides a simple yet flexible generative AI framework, improving performance and efficiency while allowing easy generalization to complex data. Today, we’re sharing a paper and code, including core implementations of both continuous and discrete Flow Matching, alongside state-of-the-art training scripts to enable the research community to easily use and iterate on the Flow Matching method. By publicly sharing this work, we hope to inspire wider adoption of Flow Matching and enable people to use it in their own generative projects.

Read the paper

Download the code

Meta Explore Theory-of-Mind

A key aspect of our social intelligence enables us to reason about the thoughts and beliefs of other agents, both human and artificial. Existing Theory-of-Mind (ToM) datasets have limitations, focusing solely on evaluation and depicting only a narrow range of interactions. To address this and move closer to achieving advanced machine intelligence, we introduce Meta Explore Theory-of-Mind, a program-guided adversarial data generation for theory of mind reasoning. Our novel framework enables the generation of diverse, challenging, and scalable ToM reasoning data for both training and evaluation, which will help accelerate progress in this critical area of research.

Explore Theory-of-Mind generates robust and reliable stories that push the limits of large language models (LLMs), making it ideal for evaluating frontier models or fine-tuning data, resulting in significant improvements on classic theory of mind benchmarks. Our first-of-its-kind approach led to a 27-point accuracy improvement on the commonly used ToMi benchmark when fine-tuning a Llama-3.1 7B model, which means unprecedented accuracy in evaluating the theory of mind training data. Explore Theory-of-Mind can be used to generate datasets for improving LLMs, enhance goal-oriented scenarios, and collect interaction datasets, while also serving as a benchmark for evaluating LLM performance.

Read the paper

Download the code

Download the dataset

Meta Large Concept Models

As we work toward advanced machine intelligence, models will need to be able to reason across languages and modalities and to excel at long-form generational capabilities that require explicit hierarchical thinking, such as writing an essay. Current mainstream language modeling approaches typically operate at the token level and don’t explicitly reason in a hierarchical manner.

Today, we’re introducing a fundamentally different training paradigm for language modeling: the Large Concept Model (LCM). The core idea of the LCM is to decouple reasoning from language representation, and it’s inspired by how humans can plan high-level thoughts to communicate. For example, when giving a presentation multiple times, a presenter always has the same series of ideas they want to convey (materialized by their slides projected on screen), but their exact choice of words might vary from one run to the other.

Guided by that principle, the LCM is a significant departure from a typical LLM. Rather than predicting the next token, the LCM is trained to predict the next concept or high-level idea, represented by a full sentence in a multimodal and multilingual embedding space. Our work explores how predictions can be made for text in such a continuous space. Overall, the LCM outperforms or matches recent LLMs in the pure generative task of summarization, offers strong zero-shot generalization to unseen languages, and is more computationally efficient as input context grows. We hope the research community uses this work to improve language models that can operate on any modality or language, in an explicit hierarchical manner.

Read the paper

Download the code

Meta Dynamic Byte Latent Transformer

Language models assume text has been tokenized in a heuristic preprocessing step, breaking words into smaller units that are easier to process. This limits end-to-end learning, is difficult to optimize in practice, and can hurt performance on rare text sequences. To address this, we’re introducing Dynamic Byte Latent Transformer, a hierarchical byte-level (tokenizer-free) model with dynamic patching schemes that are able to operate over bytes—without any tokenization heuristics—while also improving efficiency for long sequences during training and inference.

Dynamic Byte Latent Transformer outperforms tokenizer-based models across the board in terms of robustness, with a seven point advantage on average, and excels at processing longtail and rare sequences of unseen symbols. By sharing this work, we hope to accelerate advancements that will enable us to better reason over a variety of domains that are important to advanced machine intelligence, including low resource languages, coding, and factuality.

Read the paper

Download the code

Meta Memory Layers

Parametric memory, the repository of factual information stored in the weights on a neural network during pretraining, enables LLMs to understand complex concepts and linguistic nuances. As current scaling methods approach their limit of efficient scaling, new architectures that enable models to learn information more effectively must be explored. Today, we’re sharing a research paper and code for Meta Memory Layers at Scale, a method for scaling memory layers that enables an increase in factuality against commonly used benchmarks as we work toward achieving advanced machine intelligence.

Memory Layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Sparsely activated memory layers complement the compute-heavy nature of dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as MoE models when matched for both compute and parameters.

Contrary to the prevailing perception in the field that sparse memory architectures cannot be scaled competitively, we demonstrate efficient scaling of sparse memory layers up to 128 billion parameters and 8B base models, with significant improvements at comparable compute across the board for commonly used factuality benchmarks.

Read the paper

Download the code

Meta Image Diversity Modeling

This year, FAIR has focused on research to better understand and develop new methods for the safe development of image generation models. Today, we’re announcing updates on this research and releasing a comprehensive evaluation toolbox for text-to-image generative models. The image generation model we’ve developed through the course of this research builds on our prior research on generative models’ architectures and losses and prioritizes generating images that are representative of the physical world while maintaining competitive image quality with state-of-the-art models.

To further the research into new methods and techniques for responsible development, we’re collaborating with external experts, whom we’re inviting to use our model to carry out research in areas that can help us to improve the safety and responsibility across image diversity modeling. This initiative highlights our commitment to collaborating with the wider AI research community to collectively advance AI responsibility.

Additionally, we will be open sourcing a comprehensive evaluation toolbox for text-to-image generative models to improve the ease and reproducibility of image generation benchmarking while promoting interpretable takeaways that inform future responsible text-to-image research.

Through our continued work, we hope to better understand and offer new methods for responsible development of image generative models that can be adopted by the broader research community.

Read the paper

Download the code

Meta CLIP 1.2

We’re excited to release Meta CLIP 1.2, a milestone in our ongoing efforts to develop a high-performance vision-language encoder. We have been working on advanced algorithms to effectively curate and align vast amounts of image-text data, unlocking the learning of human knowledge about the world. This enables our models to learn efficiently and accurately, capturing the nuances of fine-grained mapping between image and language semantics.

Large-scale, high-quality, and diverse datasets are essential for building foundation models that can learn about the world. Meta CLIP is our effort towards building such datasets and foundation models. To ensure a high-quality and safe vision-language encoder foundation model, we’ve developed algorithms to effectively curate and align data with human knowledge from vast data pools, enabling our models to learn efficiently and cover all possibilities. We also conducted rigorous data research while applying robust integrity and privacy-protective measures.

By releasing our data algorithms, training recipes, and foundation models trained on our curated dataset, we’re providing researchers and developers with the tools they need to advance the field of vision-language understanding. These foundation models can be used as vision encoding for MLLM, multi-modal embedding for retrieval, and zero-shot classification, while serving as a starting point for research on data quality. Additionally, our algorithms and training methods can also be used to create high-quality, large-scale, CLIP-like datasets from scratch, which can help with new research or production use cases.

Read the paper

Download the dataset

Download the code

Download the model


Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023