August 12, 2021
We are sharing Unidentified Video Objects (UVO), a new benchmark to facilitate research on open-world segmentation, an important computer vision task that aims to detect, segment, and track all objects exhaustively in a video. While machines typically must learn specific object concepts in order to recognize them, UVO can help them mimic humans’ ability to detect unfamiliar visual objects.
Over the past few years, object segmentation has become one of the most active areas of research in computer vision. That’s because it’s key to correctly identify the objects in a scene or understand where they’re located. As a result, researchers have proposed a number of different approaches for segmenting objects in visual scenes, such as Mask R-CNN and MaskProp.
These state-of-the-art models work well under the closed-world assumption, which in computer vision assumes that any objects a model sees must belong to a predetermined list of object categories. Put another way, the model already knows which concepts it should detect and segment during training as well as deployment.
But in real-world applications, such as embodied AI or augmented reality assistants, there are countless object concepts that models have never seen or learned, such as those outside their predefined dictionary. Since it’s not feasible to train a model on all these open-world, unseen objects, models generally struggle to segment every object they may encounter. For example, if there are no trombones or shuttlecocks in a model’s training data, it would do a poor job of identifying trombones and shuttlecocks in new imagery.
People, on the other hand, can detect unfamiliar objects, such as novel music instruments or unknown sports equipment, even with no previous knowledge of them. Despite such unfamiliarity, people have no problem perceiving them as distinct object instances. Even cinematic examples like UFOs will be identified as independent objects. So an important research question is whether machines can also learn to segment objects without advance knowledge of their concepts. These questions motivated us to explore open-world settings where machines would be tasked with detecting and segmenting any object they encounter, regardless of whether they’re previously known or unknown.
UVO contains real-world videos adopted from Kinetics, the popular action recognition benchmark, with dense and exhaustive high-quality object mask annotations. The included video clips have an average of 13.5 unique object instances, eight times as many as in existing data sets built with the closed-world assumption. We believe UVO is a versatile test bed for researchers to develop novel approaches for open-world object segmentation, while inspiring new research that attempts to build a more comprehensive video understanding beyond classification and detection. This was not possible with the design of previous data sets and benchmarks.
Further, because annotating objects in videos is extremely resource-intensive, this type of work has not been done before. In the past, researchers have primarily focused on solving closed-world setups — while open-world problems are important, they’re difficult and often ignored. Our hope is to bring attention to this area of research by providing the UVO data set.
The primary intuition of our research is based on humans’ ability to detect novel objects regardless of their categories and to detect and localize unfamiliar objects.
We believe it’s possible to develop models capable of handling open-world settings. We demonstrate this by randomly selecting from the Kinetics data set, which is composed of YouTube videos from a wide range of sources. When training the model, instead of defining the objects to annotate, we used a crowdsourcing service to annotate all visible objects in the clips. As a result, the data set contains numerous challenging scenarios, including objects outside of classical taxonomies, fast-moving objects, motion blur, crowded scenes, and so on.
This work is now possible because we’ve employed a semiautomated pipeline, which helps human annotators by automatically generating predictions from annotated frames to unannotated frames; instead of annotating from scratch, annotators make corrections on precomputed masks.
Teaching machines to detect any object — whether familiar or entirely novel — will enable them to perform a wide range of important tasks that are beyond the abilities of today’s AI. Object search, instance registration, human-object interaction modeling, and human activity understanding all require open-world prediction abilities, for example. The open-world setting is also natural for exciting new applications in fields like robotics, autonomous driving, or augmented reality assistants, which regularly present entirely novel situations.
Current video models can understand and make predictions on only very short clips (around 2 seconds). But open-world object segmentation provides opportunities for long video modeling, as well as for more complex prediction tasks, like learning about the relationships between objects in a video or image. It also makes it possible to identify and retrieve objects in long video clips, finding portions of videos that show a skateboard, for example. Grouping pixels into semantic entities (including unknown classes) will provide plausible alternatives to existing 3D CNNs, such as reasoning about objects and their interactions.
We would like to thank Abhijit Ogale, Mike Zheng Shou, Dhruv Mahajan, Kristen Grauman, Lorenzo Torresani, Manohar Paluri, Rakesh Ranjan, and Federico Perazzi for their valuable feedback on the dataset; Jiabo Hu, Haoqi Fan, and William Wen for engineering support; Sally Yoo, Yasmine Babaei, and Eric Alamillo for supporting on annotation logistics; and our annotators for their hard work.