SEER 10B: Better, fairer computer vision through self-supervised learning on diverse datasets

February 28, 2022

We’re excited to announce new advances in SEER (SElf-SupERvised), Meta AI Research’s groundbreaking self-supervised computer vision model that can learn directly from any random collection of images on the internet — without the need for careful data curation and labeling that goes into conventional computer vision training — and then output an image embedding. SEER is now not only much more powerful, it also produces fairer, more robust computer vision models and can discover salient information in images, similar to how humans learn about the world by considering the relationships between the different objects they observe. SEER can help build breakthrough computer vision systems and advance towards building AI that works well for everyone. We’re also publicly releasing the model and sharing new technical details about how it works. While SEER is purely a research model for now, it will help Meta AI build better computer vision systems for products used by billions of people around the world.

When we first announced SEER last spring, it outperformed state-of-the-art systems, demonstrating that self-supervised learning can excel at computer vision tasks in real world settings. We’ve now scaled SEER from 1 billion to 10 billion dense parameters, making it to our knowledge the largest dense computer vision model of its kind.

Because of its increased size, SEER can extract better quality visual features and salient information present in real world large-scale data sets with trillions of random, uncurated images world-wide. This helps SEER perform better on tasks where smaller unsupervised models have struggled. With 10B parameters, SEER outperforms other models on important fairness benchmarks recently proposed by Meta AI Research. Traditional computer vision systems are trained primarily on examples from the U.S. and wealthy countries in Europe, so they often don’t work well for images from other places with different socioeconomic characteristics. But SEER delivers strong results for images from all around the globe – including non-U.S. and non-Europe regions with a wide range of income levels. Further, the 10B SEER model drastically improved performance on fairness benchmarks across different gender, apparent skin tone, and age groups. Apart from its improved performance on fairness benchmarks, this model understands images from across the world well enough to geolocalize them with unprecedented precision.

Something Went Wrong
We're having trouble playing this video.

On top of achieving strong performance on standard computer vision benchmarks (for example, 85.8 percent top-1 accuracy on ImageNet), the model also excels at challenging tasks and increases robustness to out-of-domain generalization. For example, it can correctly identify animals in sketches and artistic renditions and also handles challenges in images such as camouflage, blur, occlusion, motion, and unusual perspectives.

We remain committed to Meta AI Research’s principles of open science, so to facilitate further research and progress we are making SEER accessible by publicly releasing the model weights, implementation details, and sharing additional technical documentation explaining how it works and how it was trained.

We’ve long focused on self-supervised learning, because it allows us to move past the constraints of labeled data sets, and on building AI models that work at billion-scale. We’ve also prioritized an open science approach, where we publish our research and share our code and models so that collectively, the AI research community can learn from each others’ work, engage in peer review, and accelerate progress. SEER itself is an example of how collaboration can fuel progress in this field. The SEER model is based on SwAV, an algorithm developed jointly by Meta AI Research and Inria. And finally we are committed to developing AI responsibly and building a robust system to address issues of privacy, fairness, accountability, and transparency. As part of thateffort, we also conducted extensive state-of-the-art adversarial attacks on the SEER model to confirm the privacy of training data is protected.

Better performance and fairer predictions

We studied and validated SEER’s performance on more than 50 benchmarks – including fairness, robustness, fine-grained recognition, and a variety of image classification data sets from domains like medical imaging, satellite images, and optical character recognition (OCR). The 10 billion-parameter SEER model consistently outperformed its 1 billion-parameter predecessor, generating better visual features. Despite training on random collections of images on the internet with no data curation, the 10B model outperformed state-of-the-art supervised and self-supervised models trained on ImageNet on 70 percent of the benchmarks while achieving equal performance on the rest.

To test the model’s robustness to adversarial attacks, we evaluated it on the task of image copy detection, where the model must identify images that have been distorted through blurring, insertions, cropping, and other editing techniques. SEER outperformed the previous best results by achieving 90.6 percent mean average precision on the CopyDays benchmark, a 5.1 percent improvement. Further, SEER outperformed the state-of-the-art self-supervised models trained on ImageNet on out-of-domain robustness benchmarks and the model robustness consistently improved as we increased the size of the model.

The large SEER model captures salient information present in a large set of random and unfiltered internet images even across diverse geographies and linguistic concepts. For example, even though the model is trained only on the images themselves with no location information or other metadata, it is able to group together the same concepts in multiple languages all over the world. For example, the concept wedding from all over the world is embedded together in the model's feature space.

Something Went Wrong
We're having trouble playing this video.

To assess the models’ ability to work well for different groups (and motivated by prior research in the field of fairness and computer vision demonstrating widespread limitations in this area), we looked at whether the model could equally recognize social membership attributes, such as the gender of people with different skin tones. In this analysis we used Meta AI Research’s recently open-sourced Casual Conversations data set along with our recently announced research proposing new fairness benchmarks for computer vision models. We found the SEER 10B model more accurately recognized these social membership attributes compared to the smaller SEER models as well as ImageNet-trained supervised and self-supervised models. The larger SEER model works well for people across different genders, skin tones, and ages.

This graph shows accuracy of gender retrieval, using the Casual Conversations data set.

Also using the Casual Conversations data set, we evaluated the model labels for inaccuracies, such as predicting labels such as “non-human” or “crime” when given an image of a particular person. Here, too, SEER 10B did not produce significant numbers of these associations, whereas supervised models trained on ImageNet did.

This graph shows the rate of meaningfully inaccurate predictions for different groups of people.

Last year, we tested our 1 billion-parameter SEER model on images of everyday items from around the world and found it outperformed conventional computer vision systems in recognizing objects that, while representative of life for billions of people, are less represented in conventional image data sets used to train AI systems. The 10 billion-parameter SEER model improves upon the performance of the smaller self-supervised model and significantly outperforms supervised methods. Using Gapminder’s Dollar Street data set, which collects images of objects in households around the world along with information on their household income, we found performance improved the most for low-medium income households across the world as well as for households in non-Western regions.

These graphs look at different income and geographic groups across the world and show how much the 10 billion-parameter SEER outperformed a supervised model trained on ImageNet.

SEER also outperformed supervised ImageNet-trained models by 2 points when detecting multimodal (image + text) hate speech using the Hateful Memes dataset, which Meta AI created and shared in 2020.

Performing adversarial attacks to open-source SEER responsibly

Recent research has shown how some AI models can be vulnerable to data extraction attacks, where an adversary tries to discern whether a particular example was part of the data set used to train it. In some cases, it has been shown that adversaries can even query a model and then reconstruct specific samples from that training dataset. Because Meta AI Research focuses on open research and aims to drive innovation in AI, we prioritized identifying a privacy- and security-protective process to open source SEER, as well as to build a blueprint for responsibly open-sourcing similar models.

We set out to test the SEER model with the goal of assessing whether it is possible to infer that a particular image was part of the training data, by comparing the loss on an image to the loss given to the same image by other models. Using the Privacy Linter, an open-source tool developed at Meta, we found the accuracy of the attack was only slightly better than a completely random guess (50-50). Specifically, the maximum accuracy is at 50.02 percent, whereas the accuracy of a random attack would have 50 percent accuracy for equally sized train and held-out sets. Moreover, we computed the precision at various recall levels to make sure that no training images were exposed at low recall levels – this could happen when all the samples with the highest scores all belong in the train set: here again, precision stays below 50.15 percent for all levels of recall (including the lowest ones).

Overall, currently published state-of-the-art attacks are unable to extract membership information from the SEER 10B model trained on 1B images. In addition to publishing the weights of the pretrained SEER 10B model, we are providing model documentation detailing how SEER was created and its intended uses. We believe that the details of the model will help AI practitioners understand the model’s performance when they use it in downstream tasks.

Using self-supervision to advance AI research

There is so much richness and variety in the world, but only a tiny fraction is contained in labeled data sets. The original SEER model showed that self-supervised learning can leverage random, unannotated images to deliver state of the art performance. And now by scaling to 10 billion parameters, SEER is more robust, more private, and more fair.

Because SEER does not rely on labeled data sets, we were able to train the model on a set of examples that is much more geographically diverse than ImageNet.

We hope to advance SEER further, improving its overall performance. Meta uses self-supervised learning extensively in our production systems today and we are exploring ways to use our work with SEER to improve our existing products and services and to create entirely new ones.

In particular, advancing computer vision is an important part of building the Metaverse. For example, to build AR glasses that can guide you to your misplaced keys or show you how to make a favorite recipe, we will need machines that understand the visual world as people do. They will need to work well in kitchens not just in Kansas and Kyoto but also in Kuala Lumpur, Kinshasa, and myriad other places around the world. This means recognizing all the different variations of everyday objects like house keys or stoves or spices. SEER breaks new ground in achieving this robust performance.

We are excited to forge ahead with SEER and self-supervised learning in other domains. Ultimately, we hope to create AI systems that understand the world holistically, across modalities such as images, text, speech, and even touch. These intelligent machines will unlock the metaverse and help people perform tasks in work and in everyday life. And by sharing our work with SEER here, we hope other researchers and engineers will accelerate progress in this important field.

Read the paper: Vision models are more robust and fair when pretrained on uncurated images without supervision

Implementation details and code

Model weights and model license

Model technical documentation

Written By

Priya Goyal

Software Engineer

Piotr Bojanowski

Research Science Manager

Polina Zvyagina

Policy Manager

Pierre Stock

Research Scientist