Computer Vision

The first high-performance self-supervised algorithm that works for speech, vision, and text

January 20, 2022

Self-supervised learning — where machines learn by directly observing the environment rather than being explicitly taught through labeled images, text, audio, and other data sources — has powered many significant recent advances in AI. But while people appear to learn in a similar way regardless of how they get information — whether they use sight or sound, for example — there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities.

This discrepancy has been a significant barrier to applying advances in self-supervised learning more broadly. Because a powerful algorithm designed for, say, understanding images can’t be directly applied to another modality, such as text, it is difficult to push several modalities ahead at the same rate.

This is why Meta AI developed and is excited to announce data2vec, the first high-performance self-supervised algorithm that works for multiple modalities. We apply data2vec separately to speech, images and text and it outperformed the previous best single-purpose algorithms for computer vision and speech and it is competitive on NLP tasks. It also represents a new paradigm of holistic self-supervised learning, where new research improves multiple modalities rather than just one. It also does not rely on contrastive learning or reconstructing the input example. In addition to helping accelerate progress in AI, data2vec brings us closer to building machines that learn seamlessly about different aspects of the world around them. It will enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what today’s systems can do.

As part of this announcement, we are sharing code and pretrained models on data2vec so that others in the research community can build upon our work.

How data2vec works

Much of AI is still based on supervised learning, which works exclusively with labeled data. But it’s simply not possible to collect labeled data for all the things we would like machines to do. For example, while researchers have done a lot of work in creating large-scale labeled data sets for English speech and text, it is not feasible to do this for the literally thousands of languages spoken on the planet.

Self-supervision enables computers to learn about the world just by observing it and then figuring out the structure of images, speech, or text. Having machines that don’t need to be explicitly taught to classify images or understand spoken language is simply much more scalable.

Research in self-supervised learning today is almost always focused on one particular modality. So, researchers working on one modality often take a very different approach from those working on another. For text, researchers train models to fill in blanks in sentences. Speech models, however, need to learn an inventory of the basic sounds of speech in order to predict missing sounds. In computer vision, models are often trained to assign similar representations to a color image of a cow and the same image flipped upside down, so it associates the two much more closely than it would with an unrelated image, such as that of a duck.

Algorithms also predict different units for each modality: pixels or visual tokens for images, words for text, and learned inventories of sounds for speech. A collection of pixels is very different from an audio waveform or a passage of text, and because of this, algorithm design has been tied to a specific modality. This means that algorithms are still functioning differently in each modality.

data2vec learns in the same way for images, speech, and text.

Data2vec simplifies this by training models to predict their own representations of the input data, regardless of the modality. By focusing on these representations — the layers of a neural network — instead of predicting visual tokens, words, or sounds, a single algorithm can work with completely different types of input. This removes the dependence on modality-specific targets in the learning task. Directly predicting representations is not straightforward, and it required defining a robust normalization of the features for the task that would be reliable in different modalities.

Our method uses a teacher network to first compute target representations from an image, a piece of text, or a speech utterance. Next, we mask part of the input and repeat the process with a student network, which then predicts the latent representations of the teacher. The student model has to predict representations of the full input data even though it has a view of only some of the information. The teacher network is identical to the student model but with weights that are slightly out of date.

We tested the method on the popular ImageNet computer vision benchmark, where it performed better than existing methods for popular model sizes. On speech, we found that it performed better than wav2vec 2.0 or HuBERT, two previous Meta AI self-supervised algorithm for speech. For text, we tested it on the popular GLUE benchmark suite, and it performed as well as RoBERTa, a reimplementation of BERT.

Data2vec for computer vision: performance on the popular ImageNet benchmark for ViT-B models compared with other recent methods.

Data2vec for speech: performance for Base models on the LibriSpeech benchmark with 10h labeled data compared with other recent methods. Lower error rate indicates better performance.

Data2vec for text: performance on the GLUE natural language understanding benchmark for Base models compared with RoBERTa when retrained with the original BERT settings. Higher score indicates better performance.

Toward machines that learn from observing the world around them

While self-supervised learning has made great progress in computer vision, videos, and other individual modalities through different learning objectives, the core idea of this approach is to learn more generally: AI should be able to learn to do many different tasks, including those that are entirely unfamiliar. We want a machine to not only recognize animals shown in its training data but also adapt to recognize new creatures if we tell it what they look like. Data2vec demonstrates that the same self-supervised algorithm can work well in different modalities — and often better than the best existing algorithms. This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. We also hope data2vec will bring us closer to a world where computers need very little labeled data in order to accomplish tasks. Since it is difficult and sometimes impossible to collect annotated examples — to train speech recognition models for thousands of languages, for example — data2vec is an important step toward more general AI. This project complements research on general model architectures, and we hope that in the future we can remove the need for modality-specific feature extractors by combining these two lines of work.

Access the open source code and release pretrained models here and read the paper here.

This blog post was made possible by the work of Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.