Research
Sharing new research, models, and datasets from Meta FAIR
June 18, 2024
6 minute read

Takeaways:


  • Today, Meta FAIR is publicly releasing several new research artifacts. Our hope is that the research community can use them to innovate, explore, and discover new ways to apply AI at scale.
  • These lines of work build on our key principles of openness, collaboration, excellence, and scale.
  • We believe that access to state-of-the-art AI creates opportunities for everyone. That’s why we’re committed to the continued growth and development of an open AI ecosystem.

For more than a decade, Meta’s Fundamental AI Research (FAIR) team has focused on advancing the state of the art in AI through open research. As innovation in the field continues to move at a rapid pace, we believe that collaboration with the global AI community is more important than ever. Maintaining an open science approach and sharing our work with the community help us stay true to our goal of building AI systems that work well for everyone and bring the world closer together.

Today, we’re excited to share some of the most recent FAIR research models with the global community. We’re publicly releasing six research artifacts that focus on themes at the core of our work: innovation, creativity, efficiency, and responsibility. These releases include image-to-text and text-to-music generation models, a multi-token prediction model, and a technique for detecting AI-generated speech. By publicly sharing our early research work, we hope to inspire iterations and ultimately help advance AI in a responsible way. We can’t wait to see what the community builds with these latest releases and continue the important conversations we’re having with the open source community.


Meta Chameleon


As we shared in our research paper last month, Meta Chameleon is a family of models that can combine text and images as input and output any combination of text and images with a single unified architecture for both encoding and decoding. While most current late-fusion models use diffusion-based learning, Meta Chameleon uses tokenization for text and images. This enables a more unified approach and makes the model easier to design, maintain, and scale. The possibilities are endless—imagine generating creative captions for images or using a mix of text prompts and images to create an entirely new scene.



Today, we’re publicly releasing key components of our Chameleon 7B and 34B models under a research-only license. The models we’re releasing today were safety tuned and support mixed-modal inputs and text-only output to be used for research purposes. While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model. With the existing models we’re sharing today, we hope to encourage the research community to design new detection and mitigation strategies that will help scale generative modeling research in a responsible way.



Get the models

Multi-Token Prediction


Most modern LLMs have a simple training objective: predicting the next word. While this approach is simple and scalable, it’s also inefficient. It requires several orders of magnitude more text than what children need to learn the same degree of language fluency.

In April, we proposed a new approach to build better and faster LLMs by using multi-token prediction. Using this approach, we train language models to predict multiple future words at once—instead of the old one-at-a-time approach. This improves model capabilities and training efficiency while allowing for faster speeds. In the spirit of responsible open science, we’re releasing the pre-trained models for code completion under a non-commercial/research-only license. We hope this enables the research community to investigate our method and the trained models’ behaviors independently.


Get the models

Meta Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation


Generative AI has enabled people to explore their creativity in new ways, such as by turning a text prompt into a clip of music. While existing text-to-music models like MusicGen rely mainly on text inputs for music generation, our new model, Meta Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation (JASCO), is capable of accepting various conditioning inputs, such as specific chords or beats, to improve control over generated music outputs. Specifically, we apply information bottleneck layers in conjunction with temporal blurring to extract relevant information with respect to specific controls. This allows the incorporation of both symbolic and audio-based conditions in the same text-to-music generation model.

Results suggest that JASCO is comparable to the evaluated baselines considering generation quality, while allowing significantly better and more versatile controls over the generated music. Today, we’re releasing the research paper together with a sample page. Later this month, we’ll release inference code as part of the AudioCraft repository under an MIT license and the pre-trained model under CC-BY-NC. We look forward to releasing the code and models in the future.


AudioSeal


Generative AI tools are inspiring people to share their creations with their friends, family, and followers on social media. As with all AI innovations, it’s important that we do our part to help ensure responsible use of these tools. Today, we’re releasing AudioSeal, which we believe is the first audio watermarking technique designed specifically for the localized detection of AI-generated speech, making it possible to pinpoint AI-generated segments within a longer audio snippet. AudioSeal revamps classical audio watermarking by focusing on the detection of AI-generated content rather than steganography. Unlike traditional methods that rely on complex decoding algorithms, AudioSeal’s localized detection approach allows for faster and more efficient detection. This design enhances the detection speed by up to 485 times compared to previous methods, making it highly suitable for large-scale and real-time applications. Our approach achieves state-of-the-art performance in audio watermarking in terms of robustness and imperceptibility.

AudioSeal is being released under a commercial license. It’s just one of several lines of responsible research we’ve shared to help prevent the misuse of generative AI tools. We include similar watermarks in speech samples generated by SeamlessM4T v2, our foundational translation model for text and speech, and Audiobox. We further detail our watermarking approach for images, speech, and text models in recent releases.


Get the model and training code

Partnership supporting the release of the PRISM dataset


Getting feedback from a diverse group of people is important to improving LLMs, however there have been open questions in the research community about methods, domains, and objectives around the feedback process. We worked with our external partners to navigate these questions supporting the release of the PRISM dataset, which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries. The dataset maps each person’s preferences and fine-grained feedback to 8,011 live conversations with 21 different LLMs.

Meta advised on the compilation of the PRISM dataset by our external partners, by focusing conversations that center subjective and multicultural perspectives on topics where there is likely to be interpersonal and cross-cultural disagreement. Our paper demonstrates the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. While we hope this will serve as a community resource, we also want it to inspire broader participation in AI development and foster a more inclusive approach to technology design.



Get the dataset from our external partners
Read the technical report

Measuring and improving geographical disparities in text-to-image generation systems


It’s important that text-to-image models work well for everyone and reflect the geographical and cultural diversity of the world. Improving these models requires new tools that enable researchers to gain a better understanding of where existing models may fall short. In order to address this goal, we’re detailing our recent research efforts and progress:


  • We developed automatic indicators called “DIG In” to evaluate potential geographical disparities in text-to-image models. In addition, to understand how people in different regions vary in their perceptions of geographic representation, we conducted a large-scale annotation study. We collected more than 65,000 annotations and more than 20 survey responses per example covering appeal, similarity, consistency, and shared recommendations for improved automatic and human evaluations of text-to-image models.

  • Through this work, we learned that people utilize specific components within an image when perceiving geographic representation, rather than viewing the entire image holistically. As part of our collaborative approach at Meta FAIR, we mentored a team of graduate students at UMass Amherst on a follow-up evaluation that decomposes the previously introduced automatic indicators into foregrounded concepts and background representations.

  • Informed by the DIG In measurement work, we also explored methods of improving the diversity of outputs from text-to-image models. In this direction, we introduced the contextualized Vendi Score guidance, which extends our previous feedback guidance work and uses an inference-time intervention that guides state-of-the-art text-to-image latent diffusion models to increase the representation diversity of the generated samples while maintaining or improving the image quality and prompt-generation consistency.


Get the DIG In code
Share:

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

Related Posts
Computer Vision
Introducing Segment Anything: Working toward the first foundation model for image segmentation
April 5, 2023
FEATURED
Research
MultiRay: Optimizing efficiency for large-scale AI models
November 18, 2022
FEATURED
ML Applications
MuAViC: The first audio-video speech translation benchmark
March 8, 2023