AI technology is developing rapidly and being applied across the globe in a variety of industries and use cases—including in the domain of computer vision. From measuring new tree growth in deforested areas to identifying the parts of a cell, computer vision models have the potential to help advance important fields of work by enabling increased automation, which can yield considerable time and cost savings. But as with any new technology, there are risks involved and it’s important to balance the speed of innovation with responsible development practices.
We want to continue advancing AI systems while acknowledging and addressing potentially harmful effects of that technological progress on historically marginalized communities. Read on to learn how Meta continues to embrace open source to push the state of the art forward while taking steps to uncover and confront systemic injustices and help pave the way toward a more equitable future.
We’re excited to announce that DINOv2, a cutting-edge computer vision model trained through self-supervised learning to produce universal features, is now available under the Apache 2.0 license. We’re also releasing a collection of DINOv2-based dense prediction models for semantic image segmentation and monocular depth estimation, giving developers and researchers even greater flexibility to explore its capabilities on downstream tasks. An updated demo is available, allowing users to experience the full potential of DINOv2 and reproduce the qualitative matching results from our paper.
By transitioning to the Apache 2.0 license and sharing a broader set of readily usable models, we aim to foster further innovation and collaboration within the computer vision community, enabling the use of DINOv2 in a wide range of applications, from research to real-world solutions. We look forward to seeing how DINOv2 will continue to drive progress in the AI field.
While DINOv2-like computer vision models allow us to accomplish tasks like image classification and semantic segmentation at unprecedented scale, we have a responsibility to ensure that our AI systems are fair and equitable. But benchmarking for fairness in computer vision is notoriously hard to do. The risk of mislabeling is real, and the people who use these AI systems may have a better or worse experience based not on the complexity of the task itself, but rather on their demographics.
That’s why we’re also introducing FACET (FAirness in Computer Vision EvaluaTion), a new comprehensive benchmark for evaluating the fairness of computer vision models across classification, detection, instance segmentation, and visual grounding tasks. The dataset is made up of 32,000 images containing 50,000 people, labeled by expert human annotators for demographic attributes (e.g., perceived gender presentation, perceived age group), additional physical attributes (e.g., perceived skin tone, hairstyle) and person-related classes (e.g., basketball player, doctor). FACET also contains person, hair, and clothing labels for 69,000 masks from SA-1B.
While FACET is for research evaluation purposes only and cannot be used for training, we’re releasing the dataset and a dataset explorer with the intention that FACET can become a standard fairness evaluation benchmark for computer vision models and help researchers evaluate fairness and robustness across a more inclusive set of demographic attributes.
We recognize that this is a novel approach to evaluating fairness in computer vision models because it accounts for person-related classes in addition to demographic and physical attributes, which allows deeper evaluation. That’s why we’re sharing this work with the broader AI community. We look forward to seeing how AI researchers and practitioners in the field use FACET to evaluate their computer vision models—and how that might inform the design and development of such models in the future.
For every image in FACET, we hired expert reviewers to manually annotate person-related demographic attributes like perceived gender presentation and perceived age group as well as correlating visual features like perceived skin tone, hair type, and accessories. In addition, the annotators defined bounding boxes for the people in the image and labeled fine-grained classes related to occupations and activities, such as doctor, disc jockey, or guitarist.
FACET helps us answer questions like:
FACET can be used to probe classification, detection, instance segmentation, and visual grounding models across individual and intersectional demographic attributes in order to develop a concrete, quantitative understanding of potential fairness concerns with computer vision models. In preliminary studies using FACET, we found that state-of-the-art models tend to exhibit performance disparities across demographic groups. For example, they may struggle to detect people in images whose skin tone is darker—and that challenge can be exacerbated for people with coily rather than straight hair. By releasing FACET, our goal is to enable researchers and practitioners to perform similar benchmarking to better understand the disparities present in their own models and monitor the impact of mitigations put in place to address fairness concerns. We encourage researchers to use FACET to benchmark fairness across other vision and multimodal tasks.
Evaluating DINOv2 with FACET
We previously evaluated our DINOv2 ViT-g backbone using established protocols for geographical fairness and potential harmful label associations and compared it against the SEERv2 model trained on uncurated Instagram data. While this study didn’t reveal better model performance for any particular group, it left the possibility for a more thorough analysis—the perfect opportunity to take FACET for a test drive.
Using FACET, we can review the performance disparities of DINOv2's ViT-g backbone across the perceived gender presentation, age group, and skin tone attributes of a person and compare them against other visual backbones. The SEERv2 model mentioned above and the most comparable OpenCLIP visual encoder are also included and kept frozen during evaluation to match the DINOv2 paper.
This high-level summary aggregates the absolute performance disparities between extreme categories across all classes in FACET. It shows that DINOv2 performs similarly to these other models: While it fares a bit worse than OpenCLIP with respect to disparity across perceived gender presentation, it performs better than other models with respect to perceived age group and skin tone.
The FACET evaluation lets us dive deeper into the potential biases of the model at the level of classes, which is where FACET adds great value compared to earlier fairness evaluations. For example, for one of its most gender-biased classes, the “nurse” class, DINOv2 model displays a disparity of +16.9 points between predictions on images with people who are perceived as having more stereotypically female attributes compared to those with stereotypically male attributes. SEERv2 and OpenCLIP exhibit a more pronounced bias for the “nurse” class, with disparities of +21.4 and +23.9 points respectively. As SEERv2 was pretrained on uncurated social media content, this might reflect a lack of diversity in the data source. And OpenCLIP, using web crawled data filtered via the CLIP vision-language model, could amplify occupational gender associations already present both in the image and text training data and this filtering model.
The preparation of DINOv2’s pre-training dataset may have inadvertently replicated the biases of the reference datasets selected for curation. Specifically, ImageNet, a dataset whose image distribution underrepresents certain groups, was used as a major reference and thus weighs significantly into the source of biases. We plan to address these potential shortcomings in future work and believe that image-based curation could also help avoid the perpetuation of potential biases arising from the use of search engines or text supervision.
The importance of open source
The first iteration of DINO was open sourced in 2021. Within a year, the community built on this work with the iBOT method, which greatly improved the performance and stability of the DINO family of models. Because iBOT was open sourced in turn, we were able to learn from their insights, improve our setup, and make further progress with DINOv2. And with the help of the open source community, DINOv2’s backbones were quickly made available in HuggingFace's PyTorch Image Models (TIMM) library post-release, which means that swapping the backbones for DINOv2 should be simple for all applications built on top of TIMM.
We’re grateful to the open source research community for contributions that were instrumental in the design and success of DINOv2. By re-releasing DINOv2 under a more permissive commercial license, we hope the community will continue experimenting responsibly, gleaning new insights, and spurring even further progress in the future.
Open source research is more than an ideal—it’s an efficient and mutually beneficial way to work, both for our research teams and the broader AI community. Our commitment to open source and the acceleration of progress in AI research is evidenced by the fact that we released DINOv2 early as a CC-BY-NC research preview while we continued working toward a more permissive license. We hope this new release makes it easier for enthusiasts, specialists, and professionals to invest their time exploring DINOv2, for new use cases and finding vulnerabilities and raising them with Meta directly. We’ll continue improving this family of models with the insights and feedback that the community provides—for the benefit of everyone.