Computer Vision

SEER: The start of a more powerful, flexible, and accessible era for computer vision

March 4, 2021

The future of AI is in creating systems that can learn directly from whatever information they’re given — whether it’s text, images, or another type of data — without relying on carefully curated and labeled data sets to teach them how to recognize objects in a photo, interpret a block of text, or perform any of the countless other tasks that we ask it to.

This approach is known as self-supervised learning, and, as Facebook AI’s Chief Scientist Yann LeCun writes, it’s one of the most promising ways to build machines that have the background knowledge, or “common sense,” to tackle tasks that are far beyond today’s AI. We’ve already seen major advances in natural language processing (NLP) as a result, where self-supervised pretraining of very large models on enormous amounts of text has led to breakthroughs in question answering, machine translation, natural language inference, and more.

Facebook AI has now brought this self-supervised learning paradigm shift to computer vision. We’ve developed SEER (SElf-supERvised), a new billion-parameter self-supervised computer vision model that can learn from any random group of images on the internet — without the need for careful curation and labeling that goes into most computer vision training today.

After pretraining on a billion random, unlabeled and uncurated public Instagram images, SEER outperformed the most advanced, state-of-the-art self-supervised systems, reaching 84.2 percent top-1 accuracy on ImageNet. SEER also outperformed state-of-the-art supervised models on downstream tasks, including low-shot, object detection, segmentation, and image classification. When trained with just 10 percent of the examples in the ImageNet data set, SEER still achieved 77.9 percent top-1 accuracy on the full data set. When trained with just 1 percent of the annotated ImageNet examples, SEER achieved 60.5 percent top-1 accuracy.

SEER’s performance demonstrates that self-supervised learning can excel at computer vision tasks in real-world settings. This is a major breakthrough that ultimately clears the path for more flexible, accurate, and adaptable computer vision models in the future.

We are sharing details on SEER with the AI community — and open-sourcing VISSL, the library we used to develop SEER — to further democratize self-supervised learning and accelerate progress toward a completely self-supervised future. Making progress on a challenge this broad and deep requires the open exchange of ideas among diverse minds in the field. We remain committed to the principles of open science, and hope that this brings the field significantly closer to building machines that understand the visual world as well as people do.

Self-supervised computer vision in the real world

Our work with SEER parallels work done in NLP, where state-of-the-art models now regularly use trillions of parameters and data sets with trillions of words of text for pretraining. With more input and larger models, performance on downstream tasks improves dramatically — and the same should be true in computer vision.

But using self-supervision for vision problems is different than for language. With text, semantic concepts are broken up into discrete words. But with images, the algorithm must decide which pixel belongs to which concept. Furthermore, the same concept will vary greatly between images, such as with a cat in different poses or viewed from different angles. We need to look at a lot of images to grasp the variation around a single concept.

Successfully scaling models to work efficiently with complex high-dimensional image data required two key components: 1) an algorithm that could learn from a vast number of random images without any metadata or annotations, and 2) a convolutional network (ConvNet) large enough to capture and learn every visual concept from this large and complex data.

Fortunately, recent progress by Facebook AI and others in the fields of self-supervised learning and ConvNet architecture design has finally made it possible to apply these ideas to computer vision — though we still needed to overcome several challenges, not least of which was the compute capabilities required.

SEER combines a recent architecture family, RegNet, with an online self-supervised training to scale pretraining to billion parameters on billions of random images.

We took advantage of a new algorithm called SwAV, which developed from a collaboration between FAIR and Inria to research self-supervised learning. SwAV uses online clustering to rapidly group images with similar visual concepts and leverage their similarities. With SwAV, we were able to improve over the previous state of the art in self-supervised learning — and did so with 6x less training time.

Training models at this scale also required a model architecture that was efficient in terms of both runtime and memory, without compromising on accuracy. Fortunately, a recent innovation by FAIR in the realm of architecture design led to a new model family called RegNets that perfectly fit these needs. RegNet models are ConvNets capable of scaling to billions or potentially even trillions of parameters, and can be optimized to fit different runtime and memory limitations.

Something Went Wrong
We're having trouble playing this video.

This video compares SEER pretraining on random IG images and pretraining on ImageNet with supervision. Our unsupervised features improve over supervised features by an average of 2 percent.

The last component that made SEER possible was the development of an all-purpose library for self-supervised learning called VISSL, which we are releasing today.

Open-sourcing the library for self-supervision

We are open-sourcing VISSL, the general-purpose library that we also used for SEER, so that the broader community can experiment with self-supervised learning from images. VISSL is a PyTorch-based library that allows for self-supervised training at both small and massive scale with a wide variety of modern methods. VISSL also contains an extensive benchmark suite and a model zoo consisting of more than 60 pretrained models, allowing researchers to compare several modern self-supervised methods.

VISSL facilitates self-supervised learning at scale by integrating several existing algorithms that reduce the per-GPU memory requirement and increase the training speed of any given model. VISSL combines:

  • Mixed precision from the NVIDIA Apex library: Reduces memory requirements and speeds up runtime

  • Gradient checkpointing from PyTorch: allows the model to be trained on large batch sizes by trading compute for memory

  • Sharded optimizer from the FairScale library: Significantly reduces memory usage by sharding model optimizer state and gradients — a concept popularized by Microsoft ZeRO

  • Dedicated optimizations for online self-supervised training: for example, a constant learning schedule that does not depend on the total number of training parameter updates

SEER’s self-supervised model is built on the same core tools included in VISSL, in combination with a custom data loader for PyTorch that has higher data throughput than the default.

A self-supervised future

Self-supervised learning has long been a focus for Facebook AI because it enables machines to learn directly from the vast amount of information available in the world, rather than just from training data created specifically for AI research. This will help us build AI that works well for more people around the world, adapts quickly to changing circumstances, extends to additional use cases, and much more. We’ve published work on using self-supervision for tasks ranging from automated speech recognition to robotics to translating between programming languages to building production tools that help detect harmful content on our platforms.

Self-supervised learning has incredible ramifications for the future of computer vision, just as it does in other research fields. Eliminating the need for human annotations and metadata enables the computer vision community to work with larger and more diverse data sets, learn from random public images, and potentially mitigate some of the biases that come into play with data curation. Self-supervised learning can also help specialize models in domains where we have limited images or metadata, like medical imaging. And with no labor required up front for labeling, models can be created and deployed quicker, enabling faster and more accurate responses to rapidly evolving situations. Self-supervised learning is a key component of creating an AI that understands the visual world, and our work on SEER gets us one step closer to that goal.

There’s still more work to be done, though, and in the spirit of collaboration and open science, we are publishing our work on SEER and releasing the accompanying library to help the broader research community push the limits of self-supervised learning in computer vision. Get started by visiting the VISSL website or by checking out the associated documentation and GitHub code.

Read the full paper:

Self-supervised pretraining of visual features in the wild