August 1, 2022
Meta AI is sharing new research on using Vision Transformers (ViTs) for object detection. Our approach, ViTDet, outperforms previous alternatives on benchmarks on the Large Vocabulary Instance Segmentation (LVIS) dataset, which was released by Meta AI researchers in 2019 to facilitate research on low-shot object detection. In this task, the model must learn to recognize a much wider variety of objects than conventional computer vision systems can. ViTDet outperforms previous ViT-based models in accurately recognizing objects in the LVIS dataset, which includes not just standard items like tables and chairs, but also bird feeders, wreaths, doughnuts, and much more.
To enable the research community to reproduce and build upon these advancements, we are now releasing the ViTDet code and training recipes as new baselines in our open source Detectron2 object detection library.
Over the past year, ViTs have been established as a powerful backbone for visual recognition. Unlike typical convolutional neural networks, the original ViT is a plain, non-hierarchical architecture that maintains a single-scale feature map throughout its processing. Challenges arise when applying ViTs to object detection, however. For example, how can we detect multiscale objects effectively with a plain backbone? And is a ViT too inefficient to use for object detection in high-resolution images?
Unlike existing research, such as Swin and MViTv2, ViTDet uses only plain, nonhierarchical ViT backbones. It builds a simple feature pyramid from the single-scale feature map output by the ViT and primarily uses simple, nonoverlapping window attention to extract features from high-resolution images efficiently. This design decouples the pretraining of ViT from the fine-tuning demands of detection and thus enables the object detector to benefit from readily available pretrained masked autoencoder (MAE) models.
We start by training ViTDet detectors following the Mask R-CNN framework with ViT backbones of base (B), large (L), and huge (H) sizes. We evaluate two pretraining strategies: supervised pretraining, and self-supervised MAE pretraining (supervised pretrained ViT-H model weights are not available). We measure the accuracy on LVIS by average precision of masks (Mask AP) and average precision of masks on the rare categories (Mask AP-rare). Achieving good performance on rare categories is challenging as there are 10 or fewer training samples per rare category. We have two primary observations:
Compared with supervised pretraining, MAE pretraining delivers improved LVIS results as we scale ViTDet’s ViT backbone size.
We observe strong Mask AP gains for rare category detection, which is at the heart of the low-shot detection problem posed by LVIS.
We also benchmark Mask R-CNN using other recently proposed hierarchical ViT backbones, including Swin and MViTv2. Swin and MViTv2 are pretrained with supervision on ImageNet-1K and ImageNet-21K. We search for optimal recipes separately for each backbone of base (B), large (L), and huge (H) sizes whenever available. Out of all the benchmarked backbones, ViTDet with MAE pretraining has the best scaling behavior and delivers the best performance on LVIS.
Object detection is an important computer vision task with applications ranging from autonomous driving to e-commerce to augmented reality. To make object detection more useful, CV systems need to recognize uncommon objects and objects that appear only very rarely in their training data. With ViTDet, we now see a tipping point that shows LVIS, the benchmarking dataset for low-shot object detection challenge, benefits strongly from larger backbones and better pretraining. We hope that by open-sourcing our newly established strong baselines with ViTDet, we will help the research community to further push the state of the art and to build more effective CV systems.