Data2vec 2.0: Highly efficient self-supervised learning for vision, speech and text

December 13, 2022

Many recent breakthroughs in AI have been powered by self-supervised learning, which enables machines to learn without relying on labeled data. But current algorithms have several significant limitations, often including being specialized for a single modality (such as images or text) and requiring lots of computational power. This contrasts with human learning: People appear to learn much more efficiently than current AI, and also learn from different kinds of information in a similar way, rather than relying on separate learning mechanisms for text, speech, and other modalities.

Meta AI addressed one of these limitations earlier this year when we released data2vec, the first high-performance self-supervised algorithm to learn the same way for three different modalities: speech, vision, and text. Data2vec made it much easier to apply research advances in, say, text understanding to an image segmentation or speech translation task.

Today, we’re sharing data2vec 2.0, a new algorithm that is vastly more efficient and outperforms its predecessor’s strong performance. It achieves the same accuracy as the most popular existing self-supervised algorithm for computer vision but does so 16x faster.

To make our research accessible to other researchers, we are now sharing the code and pretrained models.

How data2vec 2.0 works

The general idea of self-supervised learning is for machines to learn the structure of images, speech, and text simply by observing the world. Advances in this area have led to many breakthroughs in speech (e.g., wav2vec 2.0), computer vision (e.g., masked autoencoders), and natural language processing (e.g., BERT). But modern systems can be computationally demanding, as training very large models requires many GPUs.

Illustration of how data2vec 2.0 training works. It can be trained separately on text, speech, or images.

Similar to the original data2vec algorithm, data2vec 2.0 predicts contextualized representations of the data — or the layers of a neural network — instead of the pixels of an image, the words of a text passage, or the sounds of speech. Unlike with most other algorithms, these so-called target representations are contextualized, meaning they take the entire training example into account. For instance, the representation of the word bank is based on the entire sentence that the word appears in, and it is therefore easier to represent the correct meaning of the word (“financial institution” or “ground next to river”). We believe that contextualized targets lead to a richer learning task and enable data2vec 2.0 to learn faster than other algorithms.

We improved the efficiency of the original data2vec algorithm in several ways: First, we take target representations built for a particular training example and reuse them for masked versions (where we hide different random parts of the training example). We take each version and feed it into the student model, which predicts the same contextualized target representation for the different masked versions. This effectively amortizes the computational effort required to create target representations. Second, and similar to masked autoencoders, we do not run the student encoder network for the parts of the training examples that are blanked out (which is about 80 percent of an image in our case), thereby saving significant compute cycles. Finally, we use a more efficient decoder model that relies not on Transformer networks but on a multilayer convolutional network.

Relative training time improvements when training data2vec 2.0 to the same accuracy as popular existing algorithms on the same hardware.

Efficiency gains with data2vec 2.0

To get a better understanding of how much more efficient data2vec 2.0 is than its predecessor and other algorithms, we tested it on computer vision, speech, and text tasks on widely used benchmarks. We were looking at the final accuracy and the time it took to pretrain the model. We measured the speed of algorithms on the same hardware (number of GPUs, etc.).

For computer vision, we evaluated data2vec 2.0 on the standard ImageNet-1K image classification benchmark, where it learned to represent images. Data2vec 2.0 can equal the accuracy of masked autoencoders (MAE) but is 16x faster (measured in wall clock time in a like-for-like setting). If we give the algorithm more time, it can achieve even higher accuracy while still being faster than MAE.

Data2vec 2.0 for computer vision: The graph shows speed vs. image classification accuracy for different algorithms on the popular ImageNet-1K benchmark.

For speech, we tested it on the LibriSpeech speech recognition benchmark, where it performed more than 11 times faster than wav2vec 2.0 with similar accuracy. For natural language processing (NLP), we evaluated data2vec 2.0 on the popular General Language Understanding Evaluation (GLUE) benchmark, where it achieved the same accuracy as RoBERTa, a reimplementation of BERT, in half the training time.

Data2vec 2.0 for speech and NLP: The top graph shows speed vs. speech recognition word error rate for models pretrained on LibriSpeech, fine-tuned on 10 hours of Libri-light data, and then evaluated on dev-other. The bottom graph shows natural language understanding accuracy on the GLUE benchmark when using the original BERT setup.

Toward machines that learn efficiently

We are on a journey to build more general and efficient self-supervised algorithms that can use a single learning objective to learn from different modalities. The ability to learn more efficiently is particularly important for modalities such as video, which require a lot of computational effort to process. We hope that more efficient self-supervised learning algorithms such as data2vec 2.0 will lead to machines that can deeply understand extremely complex data, such as the contents of an entire movie.

Access the open source code and pretrained models here, and read the paper hear.

Get it on GitHub:

Read the paper:

Efficient self-supervised learning with contextualized target representations for speech, vision, and language

This blog post was made possible by the work of Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli.

The second graphic in this post has been updated to correct a typographical error.