April 30, 2021
Many of the most exciting new AI breakthroughs have come from two recent innovations: self-supervised learning, which allows machines to learn from random, unlabeled examples; and Transformers, which enable AI models to selectively focus on certain parts of their input and thus reason more effectively. Both methods have been a sustained focus for Facebook AI, and we’re pleased to share new work that uses them to advance the state of the art in computer vision.
Working in collaboration with researchers at Inria, we have developed a new method, called DINO, to train Vision Transformers (ViT) with no supervision. Besides setting a new state of the art among self-supervised methods, this approach leads to a remarkable result that is unique to this combination of AI techniques. Our model can discover and segment objects in an image or a video with absolutely no supervision and without being given a segmentation-targeted objective. As the following example shows, our features are easily interpretable, suggesting that this class of models is capable of a higher-level of image understanding.
Segmenting objects helps facilitate tasks ranging from swapping out the background of a video chat to teaching robots that navigate through a cluttered environment. It is considered one of the hardest challenges in computer vision because it requires that AI truly understand what is in an image. This is traditionally done with supervised learning and requires large volumes of annotated examples. But our work with DINO shows highly accurate segmentation may actually be solvable with nothing more than self-supervised learning and a suitable architecture. By using self-supervised learning with Transformers, DINO opens the door to building machines that understand images and video much more deeply.
High performance is important in computer vision and other tasks, obviously. But efficiency is also vital because it enables researchers to train models even if they don’t have access to large-scale computing resources. We are also sharing details on PAWS, a new model-training approach that can deliver state of the art results using much less compute. When pretraining a standard ResNet-50 model with PAWS using just 1 percent of the labels in ImageNet, we get state-of-the-art accuracy while doing 10x fewer pretraining steps.
With DINO and PAWS, the AI research community can build new computer vision systems that are far less dependent on labeled data and vast computing resources for training. We hope that our experiments will show the community the potential of self-supervised systems trained on ViT and encourage further adoption. Our code is publicly available here and here.
These projects are the results of our long-term collaborations with academic institutions across the world, including INRIA and Sorbonne University in France, and MILA and McGill University in Canada.
Transformers have produced state-of-the-art results in many areas of artificial intelligence, including NLP and speech. In the past year, seminal works have successfully adopted Transformers for computer vision problems, as well, such as image classification and detection. Using a large amount of unsupervised data offers a great opportunity to pretrain these rich Transformer-based image representations.
Training ViT with our DINO algorithm, we observe that our model automatically learns an interpretable representation and separates the main object from the background clutter. It learns to segment objects without any human-generated annotation or any form of dedicated dense pixel-level loss.
The core component of Vision Transformers are self-attention layers. In this model, each spatial location builds its representation by “attending” to the other locations. That way, by “looking” at other, potentially distant pieces of the image, the network builds a rich, high-level understanding of the scene.
When visualizing the local attention maps in the network, we see that they correspond to coherent semantic regions in the image.
DINO works by interpreting self-supervision as a special case of self-distillation, where no labels are used at all. Indeed, we train a student network by simply matching the output of a teacher network over different views of the same image.
We identified two components from previous self-supervised approaches that are particularly important for strong performance on ViT, the momentum teacher and multicrop training, and integrated them into our framework. The resulting model achieves state-of-the-art performance above all previously proposed self-supervised systems, revealing the potential of ViTs for self-supervised learning.
DINO learns a great deal about the visual world. By discovering object parts and shared characteristics across images, the model learns a feature space that exhibits a very interesting structure. If we embed ImageNet classes using the features computed using DINO, we see that they organize in an interpretable way, with similar categories landing near one another. This suggests that the model managed to connect categories based on visual properties, a bit like humans do. For example, we see that animal species are clearly separated, with a coherent structure that resembles the biological taxonomy.
This well-behaved feature space allows us to do very fast k-NN classification, without the heavy burden of network fine tuning or learning classifiers. We compared our model with state-of-the-art SSL techniques on the task of ImageNet classification. When comparing our models across throughput regimes, we see that DINO leads to the best performance.
Another surprising finding of this work is that this method is also among the best at identifying image copies, even though it was not designed for this. DINO-based models could potentially become the standard for copy detection systems used to identify misinformation and copyright infringement.
As noted above, we’ve also developed a new approach, called PAWS, that achieves better classification accuracy than previous state-of-the-art self-supervised and semi-supervised approaches while performing up to an order of magnitude (4x to 12x) fewer training epochs. For example, when training a ResNet-50 model on the ImageNet data set, with roughly only 13 labeled examples of each class, PAWS significantly surpasses the previous state-of-the-art after only 100 epochs of training (12x less than the previously best method), and sets a new state-of-the-art of 66 percent top-1 after only 200 epochs, which is a +6 percent improvement over the previously best method.
PAWS builds on self-supervised learning approaches like SwAV, but in contrast to self-supervised methods, PAWS achieves these results by leveraging a small amount of labeled data in conjunction with unlabeled data. Similar to self-supervised approaches, the focus during pretraining is to train a neural network to map images to latent representations. Given an unlabeled training image, we generate two or more views of the image using random data augmentations and transformations, and we train the neural network to make the representations of these views similar to one another.
Unlike self-supervised methods that directly compare the representations, PAWS uses a random subsample of labeled images to assign a (soft) pseudo-label to the unlabeled views. The pseudo-labels are obtained by comparing the representations of the unlabeled views with representations of labeled support samples, and then the model is updated by minimizing a standard classification loss, like cross-entropy, between the pseudo-labels of pairs of views of the same unlabeled image. Since it does not directly optimize prediction accuracy on the labeled samples, PAWS is much less prone to overfitting than other semi-supervised approaches. At the same time, by leveraging a small amount of labeled data in this way, PAWS trains significantly faster than typical self-supervised methods. Furthermore, in PAWS we form predictions for the positive and anchor view in slightly different ways, sharpening the target prediction, and as a consequence, PAWS is guaranteed to not learn “collapsing representations,” where all images get mapped to the same representation — a common issue for self-supervised methods.
The need for human annotation is usually a bottleneck in the development of computer vision systems. By making our approaches more annotation-efficient, we allow models to be applied to a larger set of tasks and potentially scale the number of concepts they can recognize. Learning with limited supervision is also important for domains where there are few annotated images, like in medical imaging.
We hope that reducing the computation requirements of self-supervised and semi-supervised approaches will increase their adoption and stimulate further research in this area. In the spirit of collaboration and open science, we are publishing our work, and the associated source-code will be released.