March 16, 2022
In order for AI to become a more useful tool, it has to learn how to accurately interpret content more holistically. This means working in multiple modalities (such as text, speech, and images) at once. For example, recognizing whether a meme is hateful requires considering both the image and the text content of the meme. Similarly, building the metaverse will require integrating multimodal models with augmented and virtual reality devices, so they can recognize the sound of a siren, for example, and display an alert showing which direction the sound is coming from.
Historically, analyzing such different formats of data together — text, images, speech waveforms, and video, each with a distinct architecture — has been extremely challenging for machines.
Over the last couple of years, Meta AI has produced a slew of research projects, each addressing an important challenge of multimodal perception — from solving a shortage of publicly available data for training (Hateful Memes) , to a creating single algorithm for vision, speech, and text (Data2vec) , to building foundational models that work across many tasks (FLAVA) , to finding the right model parameters (Omnivore) , and many others. Taken together, they represent a clear trend: Multimodal understanding will be crucial to smarter AI systems in the near future.
Today, we’re sharing a roundup of Meta AI’s recent cutting-edge multimodal research, which we believe will collectively lead to more interactive, immersive, and smarter AI systems.
Our new Omnivore model can operate on image, video, and 3D data using the same parameters — without degrading performance on modality-specific tasks. For example, it can recognize 3D models of pumpkins or videos of yachts even though at training time it observed images only of pumpkins and yachts, respectively. This enables radically new capabilities, such as AI systems that can search and detect content in both images and videos. Omnivore has achieved state-of-the-art results on popular recognition tasks from all three modalities, with particularly strong performance on video recognition. Read the paper here.
FLAVA represents a new class of “foundational model” that’s jointly trained to do over 35 tasks across domains, including image recognition, text recognition, and joint text-image tasks. For instance, the FLAVA model can single-handedly describe the content of an image, reason about its text entailment, and answer questions about the image. FLAVA also leads to impressive zero-shot text and image understanding abilities over a range of tasks, such as image classification, image retrieval, and text retrieval.
FLAVA not only improves over prior work that is typically only good at one task, but, unlike prior work, it also uses a shared trunk that was pretrained on openly available public pairs — which we hope will help further advance research. Read the paper here.
CM3 is one of the most general open source multimodal models available today. By training on a large corpus of structured multimodal documents, it can generate completely new images and captions for those images. It can also be used in our setting to infill complete images or larger structured text sections, conditioned on the rest of the document. Using prompts generated in an HTML-like syntax, the exact same CM3 model can generate new images or text, caption images, and disambiguate entities in text.
Traditional approaches to pretraining have focused on mixing the architectural choices (e.g., encoder-decoder) with objective choices (e.g., masking). Our novel approach of “causally masked objective” gets the best of both worlds by introducing a hybrid of causal and masked language models. Read the paper here.
Research in self-supervised learning today is almost always focused on one particular modality. In our recent breakthrough data2vec research, we show that the exact same model architecture and self-supervised training procedure can be used to develop state-of-the-art models for recognition of images, speech, and text. The illustration below shows how data2vec is used with images, but the same procedure can also be used to train models for speech or natural languages. Data2vec demonstrates that the same self-supervised algorithm can work well in different modalities — and it often outperforms the best existing algorithms. Read more about Data2vec here.
Our data2vec models are currently trained separately for each of the various modalities. But our results from Omnivore, FLAVA, and CM3 suggest that, over the horizon, we may be able to train a single AI model that solves challenging tasks across all the modalities. Such a multimodal model would unlock many new opportunities. For example, it would further enhance our ability to comprehensively understand the content of social media posts in order to recognize hate speech or other harmful content. It could also help us build AR glasses that have a more comprehensive understanding of the world around them, unlocking exciting new applications in the metaverse.
As interest in multimodality has grown, we want researchers to have great tools for quickly building and experimenting with multimodal, multitask models at scale. We are open-sourcing TorchMultimodal — a library of multimodal primitives (models, fusion layers, loss functions, data sets, and utilities) and a repository of examples that bring together components and common infrastructure from across the PyTorch ecosystem. As a first open source example, researchers will be able to train and extend FLAVA using this new library. Keep a look out for more details on this soon.
As part of our continued commitment to open science, we are excited to share our most recent research results and are looking forward to building the multimodal AI future together with the wider AI community.