Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.
Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation.
DINOv3’s breakthrough performance is driven by innovative SSL techniques that eliminate the need for labeled data—drastically reducing the time and resources required for training and enabling us to scale training data to 1.7B images and model size to 7B parameters. This label-free approach enables applications where annotations are scarce, costly, or impossible. For example, our research shows that DINOv3 backbones pre-trained on satellite imagery achieve exceptional performance on downstream tasks such as canopy height estimation.
We believe DINOv3 will help accelerate existing use cases and also unlock new ones, leading to advancements in industries such as healthcare, environmental monitoring, autonomous vehicles, retail, and manufacturing—enabling more accurate and efficient visual understanding at scale.
We’re releasing DINOv3 with a comprehensive suite of open sourced backbones under a commercial license, including a satellite backbone trained on MAXAR imagery. We’re also sharing a subset of our downstream evaluation heads, enabling the community to reproduce our results and build upon them. Additionally, we’re providing sample notebooks so the community has detailed documentation to help them start building with DINOv3 today.
Unlocking high-impact applications with self-supervised learning
DINOv3 achieves a new milestone by demonstrating, for the first time, that SSL models can outperform their weakly supervised counterparts across a wide range of tasks. While previous DINO models set a significant lead in dense prediction tasks, such as segmentation and monocular depth estimation, DINOv3 surpasses these accomplishments. Our models match or exceed the performance of the strongest recent models such as SigLIP 2 and Perception Encoder on many image classification benchmarks, and at the same time, they drastically widen the performance gap for dense prediction tasks.

DINOv3 builds on the breakthrough DINO algorithm, requiring no metadata input, consuming only a fraction of the training compute compared to prior methods, and still delivering exceptionally strong vision foundation models. The novel refinements introduced in DINOv3 lead to state-of-the-art performance on competitive downstream tasks such as object detection under the severe constraint of frozen weights. This eliminates the need for researchers and developers to fine-tune the model for specific tasks, enabling broader and more efficient application.
Finally, because the DINO approach is not specifically tailored to any image modality, the same algorithm can be applied beyond web imagery to other domains where labeling is prohibitively difficult or expensive. DINOv2 already leverages vast amounts of unlabeled data to support diagnostic and research efforts in histology, endoscopy, and medical imaging. In satellite and aerial imagery, the overwhelming volume and complexity of data make manual labeling impractical. With DINOv3, we make it possible for these rich datasets to be used to train a single backbone that can then be used across satellite types, enabling general applications in environmental monitoring, urban planning, and disaster response.
DINOv3 is already having real-world impact. The World Resources Institute (WRI) is using our latest model to monitor deforestation and support restoration, helping local groups protect vulnerable ecosystems. WRI uses DINOv3 to analyze satellite images and detect tree loss and land-use changes in affected ecosystems. The accuracy gains from DINOv3 support automating climate finance payments by verifying restoration outcomes, reducing transaction costs, and accelerating funding to small, local groups. For example, compared to DINOv2, DINOv3 trained on satellite and aerial imagery reduces the average error in measuring tree canopy height in a region of Kenya from 4.1 meters to 1.2 meters. WRI is now able to scale support for thousands of farmers and conservation projects more efficiently.
Scalable and efficient visual modeling without fine-tuning
We built DINOv3 by training a 7x larger model on a 12x larger dataset than its predecessor, DINOv2. To showcase the model’s versatility, we evaluate it across 15 diverse visual tasks and more than 60 benchmarks. The DINOv3 backbone particularly shines on all dense prediction tasks, showing an exceptional understanding of the scene layout and underlying physics.
The rich, dense features capture measurable attributes or characteristics of each pixel in an image and are represented as vectors of floating-point numbers. These features are capable of parsing objects into finer parts, even generalizing across instances and categories. This dense representation power makes it easy to train lightweight adapters with minimal annotations on top of DINOv3, meaning a few annotations and a linear model are sufficient to obtain robust dense predictions. Pushing things further and using a more sophisticated decoder, we show that it’s possible to achieve state-of-the-art performance on long-standing core computer vision tasks without fine-tuning the backbone. We show such results on object detection, semantic segmentation, and relative depth estimation.
Because state-of-the-art results can be achieved without fine-tuning the backbone, a single forward pass can serve multiple applications simultaneously. This enables the inference cost of the backbone to be shared across tasks, which is especially critical for edge applications that often require running many predictions at once. DINOv3’s versatility and efficiency make it the perfect candidate for such deployment scenarios, as demonstrated by NASA’s Jet Propulsion Laboratory (JPL), which is already using DINOv2 to build exploration robots for Mars, enabling multiple vision tasks with minimal compute.
Scaling DINOv3 to 7B parameters shows SSL’s full potential. However, a 7B model is impractical for many downstream applications. Following feedback from the community, we built a family of models spanning a large range of inference compute requirements to empower researchers and developers across diverse use cases. By distilling the ViT-7B model into smaller, high-performing variants like ViT-B and ViT-L, DINOv3 outperforms comparable CLIP-based models across a broad evaluation suite. Additionally, we introduce alternative ConvNeXt architectures (T, S, B, L) distilled from ViT-7B, that can accommodate varying compute constraints. We’re also releasing our distillation pipeline to enable the community to build upon this foundation.
Get started with our pre-trained models, code, and community resources
Over the last four years, we’ve seen the impact of DINO and DINOv2 across industries, and we’re excited to continue that momentum with DINOv3. Our early DINOv3 partners are already sharing impressive results, and we’re excited to see the meaningful new technologies the open source community will develop with our most capable model yet. As always, we’ll work closely with our partners, listen to feedback, and continuously iterate—making our models better for everyone.
Download the DINOv3 artifacts:
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Our approach
Latest news
Foundational models