April 8, 2022
Today, we are releasing the first-ever external demo based on Meta AI's self-supervised learning work. We focus on Vision Transformers pretrained with DINO, a method we released last year that has grown in popularity based on its capacity to understand the semantic layout of an image.
Our choice to focus the first demo on DINO is motivated by its ability to learn both general and powerful semantic features, including patch-level matching and retrieval. Using the demo, people will be able to experience these advancements firsthand, including finding similar images or pieces of similar images, such as matching the eyes of a puppy to find similar-looking dogs, regardless of their position, location, or lighting in an image.
While this may sound like a trivial use case, the technology underpinning this demo is part of the important bigger-picture future we are building at Meta AI. Computer vision powered by self-supervised learning is an important part of helping Meta AI researchers deliver AI systems that are more robust and less domain-centric in nature.
DINO enables AI researchers to build highly efficient computer vision systems that perform extremely well at a variety of tasks and are far less dependent on labeled data sets. For this to work, large-scale self-supervised learning training for computer vision needs an algorithm that can learn from random, unlabeled images and videos, and a vast amount of data to capture every piece of a diverse, everyday life. Our new AI Research SuperCluster will allow us to explore the training of larger models on even larger data sets, pushing the boundaries of what self-supervised learning can achieve.
While we previously released the DINO code, this demo allows researchers and engineers to explore how the model understands images, to test its robustness, and to try it on their own images. And it allows others who are interested in new AI techniques to see how a single technique can create models that are generic enough to solve many tasks.
There are several experiences people can explore in the demo. Through image retrieval, a person could select a picture and discover similar images from a third-party data set of five million images. Patch-level retrieval lets people select an object or area from an image to discover similar images, such as the dog eyes we mentioned earlier. Finally, patch-matching can find similar areas between two given images, despite differences in the background, positioning of objects, and lighting.
When a person opens the demo and inputs an image or defines a patch of an image, DINO outputs features and descriptions that can be used to specify how similar it is to other images. These outputs are useful because they can be used to compute the distance between two images, in the same way we can compute distances between 3D points described by three numbers. (For example, an image of a cat is “far away” from the image of a car but close to the image of a dog, and even closer to the image of another cat.) It’s this distance property that powers the DINO demo and delivers results, whether retrieving the nearest image or using patch-matching to show the closest patch.
DINO provides a training procedure to enable an untrained model to learn this property, without using any labeled data. It’s based on a simple intuition: Given an image, we apply several modifications and teach our model that the modified image should still be similar to the original image. These modifications include changing the brightness or contrast, cropping a smaller part of the image, or rotating the image. With each modification, the model can learn something. From rotating, it learns that a bunny in different poses will still represent the same thing, while the brightness modification will teach it that a bunny in the shadow is similar to a bunny in bright sunlight.
While this model wasn't developed with metaverse applications in mind, there are potential future applications for doing visual queries that can be personalized and remain entirely on a person’s device, which can help keep data more private. For example, you take a photo of an object to teach DINO “these are my car keys.” Later, when looking for your keys, you can query “Where are my car keys?” This type of application requires being able to memorize objects and find them in images — and this is something the DINO model can do well.
Image duplication identification is another potential future use case. DINO-based models could help detect copies of a particular piece of harmful content, even when the image has been modified. We believe self-supervised learning advances will ultimately pave the way for a future where machine learning algorithms can be built on and stay on a person’s device, creating a more private and personalized future powered by AI assistants.
While we are only beginning to harness the potential of self-supervised learning, we believe it will be an important advancement as we help build the metaverse and new AR/VR experiences. Self-supervised learning helps us gain a deep understanding of real-world environments and how people experience them, which is too big and diverse to capture in labeled data sets. We'll need AI that can learn from everything it sees and hears, and that's only possible with self-supervised learning.
While DINO shows an advancement in self-supervised learning, and has many exciting potential future use cases, we want to make sure this demo is used as part of our open science responsible AI. It is against the demo’s terms of use to upload photos of humans, and we include a detector to block human faces.
We invite everyone to try our demo. While self-supervised learning is still in its infancy, we are excited about the potential it holds for the future as we continue to work on more private and personalized AI projects.
Research Engineer
Research Engineer
Research Engineer
Research Scientist Manager
Research Engineering Manager
Research Director
Foundational models
Latest news
Foundational models