October 7, 2021
Pretraining using large labeled data sets has become a core tool for developing high-performance computer vision (CV) models. But while this method works well with many types of media, it hasn’t been widely used for 3D recognition tasks, such as identifying and localizing a couch in a 3D scan of a living room.
This is due to a lack of annotated data and the time-consuming nature of labeling 3D data sets. Additionally, models for 3D understanding often rely on a handcrafted architecture design that is tightly coupled with the particular 3D data set used for training.
At ICCV 2021, we are presenting 3DETR and DepthContrast, two complementary new models that advance 3D understanding and make it significantly easier to get started. Our new models address these common challenges by establishing a general 3D architecture that simplifies 3D understanding, and through a self-supervised learning method that doesn’t require labels.
Facebook AI is now making this research and code available to the open source community.
Building machines to understand 3D data about the world is important for lots of reasons. Autonomous cars need 3D understanding in order to move and avoid bumping into obstacles, while AR/VR applications can help people do practical tasks, such as visualizing whether a couch would fit in a living room.
While data from 2D images and videos is represented as a regular grid of pixels, 3D data is reflected as point coordinates. Since 3D data is harder to acquire and label, 3D data sets are also typically much smaller than image and video data sets. This means they are usually limited in terms of both their overall size and the number of classes or concepts they contain.
Previously, a practitioner focused on 3D understanding would need significant domain knowledge to adjust a standard CV architecture. Single-view 3D data (taken from one camera that also records depth information) is much easier to collect than multiview 3D, which leverages two or more cameras recording the same scene. Multiview 3D data is often generated by post-processing single-view 3D, but this processing step has a failure rate estimated by some researchers to be as high as 78 percent, for reasons such as blurry source images or excessive camera motion.
Our work DepthContrast solves these data challenges as it trains self-supervised models from any 3D data, whether single-view or multiview, therefore eliminating the challenge of working with small, unlabeled data sets. (Pretraining on even large quantities of 2D images or video is unlikely to yield accurate 3D understanding for sophisticated applications such as AR/VR.)
3DETR, our second work, is an abbreviation of 3D Detection Transformer. The model is a simple 3D detection and classification architecture based on Transformers, which can be used as a general 3D backbone for detection and classification tasks. Our model simplifies the loss functions used to train 3D detection models, which makes it much easier to implement.
It also equals or exceeds the performance of prior state-of-the-art methods that rely on hand-tuned 3D architectures and loss functions.
These models have massive potential to be used for everything from helping robots navigate the world to bringing rich new VR/AR experiences to people using their smartphones and future devices, such as AR glasses.
With 3D sensors now ubiquitous in phones, a researcher could even obtain single-view 3D data from their own device to train the model. The DepthContrast technique is a first step toward using this data in a self-supervised way. By working with both single- and multiview data types, DepthContrast greatly increases the potential use cases of 3D self-supervised learning.
3DETR takes a 3D scene — represented as a point cloud, or set of XYZ point coordinates — as input and produces a set of 3D bounding boxes for objects in the scene. This new research builds on VoteNet, our model to detect objects in 3D point clouds, and Detection Transformers (DETR), a simpler architecture created by Facebook AI for reframing the challenge of object detection.
In order to make the jump from 2D detection, which previous research from Facebook AI does well, we identified two important changes we needed to address for Transformers to work for 3D understanding. We needed non-parametric query embeddings and Fourier encodings.
Both of these design decisions are needed because point clouds have varying density between a lot of empty space and noisy points. 3DETR uses two techniques to deal with this. Fourier encodings are a better way to represent the XYZ coordinates compared with the standard (sinusoidal) embeddings used in DETR and other Transformer models/DETR.
Secondly, DETR uses a fixed set of parameters (called queries) to predict the location of the objects. We found that this design decision does not work for point clouds. Instead, we sample random points from the scene and predict objects relative to these points. In effect, we do not have a fixed set of parameters to predict locations, and our random point sampling adapts to the varying density of the 3D point cloud.
Using the point cloud input, the Transformer encoder produces a representation of the coordinates of an object’s shape and position in the scene. It does this through a series of self-attention operations to capture the global and local contexts necessary for recognition. For instance, it can detect geometric properties of a 3D scene, such as the legs and backrests of chairs positioned around a circular table. As we visualize below in self-attention maps, the encoder automatically captures these important geometric properties.
The Transformer decoder takes these point features as input and outputs a set of 3D bounding boxes. It applies a series of cross-attention operations on the point features and query embeddings. The decoder’s self-attention shows that it focuses on the objects in order to predict bounding boxes around them.
The Transformer encoder is also generic enough that it can be used for other 3D tasks, such as shape classification.
Overall, 3DETR is much simpler to implement than prior work. On 3D benchmarks, 3DETR performs competitively against prior handcrafted 3D architectures. Its design decisions are also compatible with prior 3D work, giving researchers the flexibility to adapt components from 3DETR to their own pipelines.
Self-supervised learning has been a major area of interest in the research community and at FAIR. DepthContrast is our latest attempt to learn powerful 3D representations without using labeled data. This research is related to our previous work on PointContrast, which was also a self-supervised technique for 3D.
The opportunities for obtaining 3D data are now plentiful. Sensors and multiview stereo algorithms often provide complementary information to video or images. However, making sense of this data has previously been a challenge since 3D data has different physical characteristics that depend on how and where it was acquired. For example, depth data from commercial phone sensors looks very different when compared with data from outdoor sensors, such as LiDAR.
Most 3D data used in AI research is acquired in the form of single-view depth maps, which are post-processed by a step called 3D registration to obtain multiview 3D. Prior work has relied on multiview 3D data for learning self-supervised features, with losses designed to take 3D point correspondences into account.
While converting single-view data into multiview data has a high failure rate, as we noted above, DepthContrast shows that using just single-view 3D data is enough to learn state-of-the-art 3D features.
Using 3D data augmentation, we can generate slightly different 3D depth maps from a single-view depth map. DepthContrast does this by using contrastive learning to align the features obtained from these augmented depth maps.
We show that this learning signal can be used to pretrain different types of 3D architectures, such as PointNet++ and Sparse ConvNets.
More important, DepthContrast can be applied to any type of 3D data, whether acquired indoors or outdoors, and single or multiview. Our research shows that models pretrained using DepthContrast set an absolute state of the art on the ScanNet 3D detection benchmark.
DepthContrast’s features provide gains across a variety of 3D benchmarks on tasks such as shape classification, object detection, and segmentation.
DepthContrast shows that self-supervised learning is also promising for 3D understanding. In fact, DepthContrast shares the underlying principle of learning augmentation invariant features, which has been used to power self-supervised models such as Facebook AI’s SEER.
Self-supervised learning continues to be a powerful tool for learning representations across text, images, and video. With depth sensors now found in most smartphones, there are significant opportunities to advance 3D understanding and create new experiences that more people can enjoy.
We hope 3DETR and DepthContrast will help current and new practitioners develop better tools for 3D recognition without the high barriers to entry and arduous engineering previously required. 3DETR is available here and DepthContrast here. We look forward to seeing how these new techniques are applied by the open source community.
Foundational models
Latest news
Foundational models