November 22, 2021
The research team behind Meta AI’s Detectron project has recently been awarded the PAMI Mark Everingham Prize for contributions to the computer vision community. We first open-sourced the Detectron codebase five years ago as a collection of state-of-the-art algorithms for tasks such as object detection and segmentation. It has since evolved and advanced in important ways thanks to the contributions of both the open source community and many researchers here at Meta.
In 2019, we released a ground-up rewrite of the codebase entirely in PyTorch to make it faster, more modular, more flexible, and easier to use in both research-first and production-oriented projects. Earlier this year, we released Detectron2Go, a state-of-the-art extension for training and deploying efficient object detection models on mobile devices and hardware, as well as significantly improved baselines based on the recently published state-of-the-art results produced by other experts in the field.
Several members of the Detectron team sat down to discuss the project’s origins, advances, and future.
How did the Detectron project come about?
Ross Girshick, Meta AI Research Scientist: Detectron was created to make our object detection research possible. I’ve always equally valued engineering and science and I view them as inextricably linked. Scientific progress is enabled by the tools that engineering builds and engineering progresses from newfound scientific knowledge. So, in some sense, I think the Detectron project started back in 2008, when I was a first-year PhD student and needed some tools to do research.
The code has been entirely rewritten multiple times over the years, and the set of algorithms it supports has evolved enormously. But I can still trace threads from Detectron2 all the way back to 2008. One example is what we call the config system, which is the part of Detectron that makes it possible to specify and run experiments. While this likely sounds less exciting than a fancy new deep learning algorithm, the config system has always been central in my mind because it enables rigorous, reproducible scientific exploration. This system has gone through so many iterations and I think the LazyConfig approach that we now have in Detectron2 is remarkably effective.
The two most recent versions of the project were both developed by Meta AI researchers over the last five years. I started the first version of the Detectron library shortly after arriving here because I needed better tools with good multi-GPU support in order to increase experimental throughput. So again, Detectron (v1) was developed for the pursuit of research. It enabled many projects from our group that I’m very proud of, like Feature Pyramid Networks, RetinaNet, and Mask R-CNN. Its current form, Detectron2, is a ground-up rewrite based on PyTorch that represents a huge step forward for the project.
Kaiming He, Meta AI Research Scientist: Object detection research is a bellwether in computer vision. While AlexNet in 2012 revolutionized deep learning, it is R-CNN — a deep learning object detection system — that arguably convinced the computer vision community to put aside its skepticism. On one hand, object detection systems have demonstrated the potential and capability of deep learning (or, say, differentiable programming) for solving complex problems beyond classification. On the other hand, object detection has been a key component for solving more intelligent tasks, such as image captioning, visual question answering, and reasoning. A reliable, sustainable, and reproducible object detection toolbox is foundational for the computer vision community to move forward.
Research in object detection is continuous: Later ideas are built on top of previous systems and baselines. We have witnessed the meta-algorithm evolving from R-CNN to Fast R-CNN to Faster R-CNN to Mask R-CNN. The concepts of anchors (region proposal network, RPN), feature pyramid networks (FPN), and focal loss (RetinaNet) have been the central research topics in object detection research in recent years. A reference toolbox is of central importance for implementing, benchmarking, and reproducing ideas for the community in this continuous progress.
Piotr Dollar, Meta AI Research Manager: Our group’s progress in object detection was fueled by two elements. First, we did some really fun new research on object detection and segmentation algorithms (FPN, RetinaNet, Mask R-CNN, panoptic segmentation, etc.). Second, Detectron and Detectron2 really powered and unified our progress. Being able to both push the boundaries of research and engineering in the same group enabled us to move quickly and to try new things and iterate on ideas. And, just as we shared our ideas to the community through publications, we also shared models and frameworks for reproducing and building upon them via our code and model releases. Overall, our research and engineering efforts went hand-in-hand and contributed to the success of our work. In many ways, this template of coupling research and engineering powers many of our other efforts in our group (e.g., in 3D recognition, representation learning, video, etc.). So to answer the original question of how Detectron came about: it was really an integral part of how we did our research in object detection and segmentation!
What do you think was key to it becoming such a widely used library among AI researchers and engineers?
Alexander Kirillov, Meta AI Research Scientist: When developing the library, one of our main focuses was to make future explorations as easy as possible. For our group, Detectron2 is the starting point for a large chunk of research projects. So we tried to make sure the library does not create any overhead for adding new features and one could quickly try out a new idea without needing to write a lot of supporting code. As we improved the architecture of Detectron2 and added new features, tasks, and data sets, we always tried to make sure that these changes do not restrict our abilities to quickly test new ideas. In my opinion, this ease of trying new things is one of the key properties that attracted a lot of researchers to Detectron2. Another important factor has been our model zoo. The models there are implemented in a memory- and compute-efficient way, so they do not take up all the GPU memory and they leave space for the development of new ideas.
Wan-Yen Lo, Meta AI Research Manager: The first generation of the Detectron library was implemented in Caffe2 and released in 2018. After gathering feedback from many researchers and practitioners, we rewrote the library from scratch in PyTorch and designed the second generation, Detectron2, to be modular, extensible, scalable, and efficient. Detectron2 allows researchers to explore novel ideas easily by reusing the optimized components in the library. Our team also collaborated with mobile vision researchers and engineers here to build a production layer, D2Go, on top of Detectron2 to make it easier to deploy advanced new models to production. Detectron2 and D2Go are used widely across Meta for conducting research (e.g., PointRend, Mesh R-CNN) and for powering product applications (e.g., Smart Camera in Portal). The code released on GitHub is the same as what’s used internally here. We share all our learnings and work openly, and I think that’s key to why the Detectron project is so popular in the community.
Ilija Radosavovic (now a PhD student at UC Berkeley): One aspect I would like to highlight is the completeness of the release. In particular, Detectron included training scripts, configuration files, and all the necessary details to reproduce a range of baselines with a single command. It also included an extensive model zoo with over 70 pretrained models. While some of these aspects have now become standard, they were certainly not common at the time. Overall, I believe that Detectron has become the default template for releasing open source computer vision projects.
How does Detectron demonstrate Meta AI’s open science approach to research, and how did that approach benefit the project?
Wan-Yen Lo: I’ve given several talks about our work, and one of the most common questions is, “Why are you and your colleagues so selfless in sharing your work? Don’t you worry about competition?” The answer can actually be found in the mission statement of Facebook AI Research (FAIR): “Advance the state of the art in artificial intelligence through open research for the benefit of all.” We believe that AI cannot be solved in a vacuum, and by leveraging help from the entire community, it’ll be more efficient and beneficial for everyone. The openness is one factor that attracted many researchers, including me, to join FAIR, and we actually benefit from the openness as well. For example, we released Detectron2 to allow other researchers to develop and publish new work more quickly, and after they release their code based on our library, we could innovate further upon their work more easily as well.
Ross Girshick: The Detectron project grew out of previous open source efforts, and Meta AI has been incredibly supportive of paying that work forward by enabling the initial open source release of Detectron as well as its continued development through Detectron2. This support has been very important for the broader community because for a while there were very few open source detection systems available. (There are a few more options today.) The original Detectron library was really instrumental in enabling scientific progress in the field, particularly around 2018–2020. It established a high bar for reproducible research and making state-of-the-art research artifacts (models, training scripts, etc.) available for the benefit of all. In my view, it’s been a model project for others to follow.
What were some of the hard challenges or decisions with the project?
Yuxin Wu, Meta AI Research Engineer: Assumptions and fixed paradigms are inevitable in software development, but they tend to be broken by new innovations. Since Detectron2 is used by practitioners to innovate in computer vision research, designing it with future-proof flexibility is both very challenging and crucial to the project. We had to carefully design each component so that our abstractions are sufficient to help our users but not too heavy to constrain them. We are happy to find that the flexibility allows the project to be adopted in scenarios we didn’t anticipate at the beginning.
Ilija Radosavovic: Detectron is a research platform whose primary goal is to support rapid implementation and evaluation of research ideas. However, it is also a large software system that should follow good software engineering practices and be easy to test, maintain, and extend. Striking a good balance between the two has often been quite challenging. Overall, I believe Detectron made reasonable trade-off choices, but this certainly remains a challenge for the field at large when it comes to developing large software systems for research.
How has the open source community contributed to Detectron?
Francisco Massa, Meta AI Research Engineer: Nearly 200 developers from around the world have contributed to the original Detectron library and Detectron2, with nearly a quarter of all Detectron2 pull requests coming from the open source community. The open source community has spotted (and fixed) many bugs that would have otherwise gone unnoticed. Indeed, more than 3,000 issues have been opened on GitHub, the majority of which have since then been addressed — in many cases by the open source community itself. This feedback loop creates a healthy ecosystem for researchers and practitioners to learn and push the field of computer vision forward, and was critical for the success of Detectron.
Yuxin Wu: In addition to contributions directly to the project, the community also helped the project by growing the ecosystem around it. We have started to see an increasing number of research papers and tools released based on Detectron, and we actively draw inspiration from them to improve our projects as well.
What’s surprised you most about how the project has evolved and how it’s being used?
Ross Girshick: When I started writing open source object detection code back in 2008, it was only used by a handful of PhD researcher types around the world. They used it mainly to do research in academia. This largely remained true until we released the original Detectron library in 2018. The users and use cases really exploded in a way that I did not anticipate. Suddenly, we were seeing all kinds of people using Detectron, from high school students to entrepreneurs to hobbyists tinkering with smart home systems. People were creating tutorials for it on YouTube and so on. Internally, at Meta, I was also surprised to see how rapidly the code was adopted for product use cases, like the Smart Camera in Portal. I probably should have anticipated this change, but I’ve always been very focused on the research front, and it caught me by surprise.
Yuxin Wu: Internally, what surprised me the most is the number of applications across our company. We knew that object detection is a core step in image understanding, but weren’t fully aware how much potential there is, given the volume of Meta products. For example, it has been used to detect texts on Facebook; to detect merchandise on Instagram; to detect human pose on Reels; to detect keyboards on Workrooms. With our company’s mission to build a metaverse, the capability of perception and scene understanding offered by the project is going to be even more important, and I’m excited to see what’s there to come.
Ilija Radosavovic: People used Detectron in all sorts of creative ways, from developing new research projects to powering self-driving cars to counting animals on a farm. This was certainly both the most surprising and the most rewarding part.
Given the speed at which computer vision research has advanced, how has Detectron remained such a widely used tool?
Alexander Kirillov: The library is being developed by a group of active computer vision researchers. We build our own new projects based on Detectron2 and therefore have no option but to keep up with the changes. For example, Detectron2 supports the newest family of Transformer-based models for detection (DETR) and segmentation (MaskFormer). We also adopted longer training schedules and more aggressive data augmentation strategies following the change in how the community evaluates the best new models. In addition, unlike many other libraries, with Detectron2 we decided to build a single system to support a set of localization tasks. We correctly guessed that our community is getting increasingly more interested in unified models and multitask setups which are native for Detectron2. Stay tuned for new features. (=
Georgia Gkioxari, Meta AI Research Scientist: Detectron2 is the go-to library for researchers working in 2D object recognition. But its power extends well beyond 2D. Its modular design, the ease with which one can switch parts of the network and design new backbones and heads, makes it an extremely valuable toolkit for any image-based research project. The model zoo and the various model designs make Detectron2 a solid foundation for any project. You basically start with a state-of-the-art object recognition model. What more can you ask?! Detectron2 is the foundation for Mesh R-CNN and other 3D projects that involve object-centric understanding regardless of the final task. It is written in PyTorch so it easily blends with other libraries like PyTorch3D, which in turn opens the door for exciting, out-of-the-box, novel ideas, projects, and directions.
What’s next for the project?
Wan-Yen Lo: We will continue working with the community to add state-of-the-art research to Detectron2. For example, we just released significantly improved Mask R-CNN baselines. We are actively conducting research on object detection and segmentation for images, videos, and even 3D data. We recently obtained state-of-the-art results with Multiscale Vision Transformers, and we plan to release the models in Detectron2 soon. We have several exciting projects in progress, and we will share the results openly later. Please stay tuned!
Alexander Kirillov: We observe that different subfields of computer vision are getting closer to one another. For instance, thanks to Transformers, similar architectures are now used for different modalities, like images and video. More and more researchers are working on multimodal and multitask settings that transcend simple-but-limited setups of a single-modality data set with a single fixed task and evaluation protocol. Seeing this shift in our community, we strive to create a larger infrastructure of interoperable components across Detectron2, PyTorch3D, and PyTorchVideo, allowing users to build new more holistic solutions for computer vision.
AI Blog Manager