Computer Vision

Facebook Research at CVPR 2020

June 12, 2020

Computer vision (CV) researchers and engineers from all over the world will be gathering virtually for the 2020 Conference on Computer Vision and Pattern Recognition (CVPR) from June 14 to June 19, 2020. Facebook AI researchers, as well as researchers in AR/VR, will be presenting research via presentations, hosting tutorials, speaking in workshops, and participating in interactive online Q&As.

At this year’s CVPR conference, Facebook AI is pushing the state of the art forward in many important areas of CV, including core segmentation tasks, architecture search, transfer learning, and multimodal learning. We’re also sharing details on several notable papers that propose new ways to reason about the 3D objects shown in regular 2D images. This work could help us unlock virtual and augmented reality innovations, among other future experiences. We’re also sharing the full list of abstracts and details of our CVPR participation below.


Novel views from only a single image in complex, real-world scenes

We’ve built SynSin, a state-of-the-art, end-to-end model that can take a single RGB image and then generate a new image of the same scene from a different viewpoint — without any 3D supervision. Our system works by predicting a 3D point cloud, which is projected onto new views using our novel differentiable renderer via PyTorch3D. The rendered point cloud is then passed to a generative adversarial network (GAN) to synthesize the output image. Current methods often use dense voxel grids, which have shown promise on synthetic scenes of single objects, but haven’t been able to scale to complex real-world scenes.

With the flexibility of point clouds, SynSin not only achieves this but also generalizes to varying resolutions with more efficiency than alternatives such as voxel grids. SynSin’s high efficiency could help us explore a wide range of applications, for example, generating better 3D photos and 360-degree videos. Read the full research paper here

Reconstructing 3D human figures in unprecedented level of detail and quality from a single image

We’ve developed a novel method for generating 3D reconstructions of people from 2D images with state-of-the-art quality and detail. It captures highly intricate details such as fingers, facial features, and clothing folds using high-resolution photos as input, which was not possible with previous techniques without additional processing.

To achieve this, we built upon the highly memory-efficient Pixel-Aligned Implicit Functionn (PIFu) method, and created a hierarchical multilevel neural network architecture to process both global context and local details to achieve high-resolution 3D reconstruction. The first-level network considers the global 3D structure of humans by utilizing lower-resolution input images, similar to the PIFu method. The second network is a lightweight network that can take the higher, 1K-resolution input image to analyze the local details. By enabling access to the global 3D information from the first level, our system can leverage local and global information efficiently for high-resolution 3D human reconstruction. You can see the qualitative results of our method compared with the state of the art below:

Such high-quality, fine-grained detailed 3D reconstructions could help enhance important applications like creating more realistic virtual reality experiences. Read the full research paper here.

‘Wish you were here’: Context-aware human generation

We’ve built a new system that can take an image of a person from one photo and then add it to a different image, while maintaining the quality and semantic context of the scene interaction. It can generate an image of a person in the context of other people in an image, adjusting the source image so that their pose matches the new context. This is a challenging application domain, since it is particularly easy to spot discrepancies between the novel person in the generated image and the existing ones. Unlike previous work on adding people to existing images, our method can be applied to a variety of poses, viewpoints, scale, and severe occlusions.

Our method involves three subnetworks. The first generates the structure of the novel person, the second renders a realistic person given the generated structure and an input target, and the third refines the rendered face. We’ve demonstrated high-resolution outputs in various experiments. We’ve also evaluated each network individually, demonstrating state-of-the-art results in pose transfer benchmarks as well as in other possible applications, such as drawing a person or replacing a person’s hair, shirt, or pants.

With the recent increased interest in remote events and interactions across locations, our research could make it easier for people to collaborate more naturally when using video tools or inspire new AR experiences. Read the full paper here.

Facebook AI’s workshops at CVPR 2020

Workshop on Media Forensics June 15, 2020

In addition to the new system above, Facebook AI is studying other areas of face and pose generation, as well as the larger open challenge of manipulated media. We’ll be presenting advancements made in deepfake detection and new synthesis methods such as generative adversarial networks (GANs). Facebook AI speakers at this workshop include Cristian Canton Ferrer and Tal Hassner.

Low-power Computer Vision Challenge June 15, 2020

We’re hosting a workshop around the Low-Power Computer Vision Challenge (LPCV), along with other community members such as Purdue, Duke and Google. The LPCV workshop will focus on bringing community members together to discuss the state of the art of low-power computer vision, challenges in creating efficient vision solutions, promising technologies, methods to acquire and label data, as well as benchmarks and metrics to evaluate progress. Teams competing in the video track of the competition that utilizes PyTorch-Mobile will also gather in this workshop. Facebook AI speakers include Vikas Chandra, Carole-Jean Wu, Joseph Spisak, and Christian Keller.

Women in Computer Vision Workshop June 14, 2020

Facebook AI’s Georgia Gkioxari will be speaking at the half-day workshop for Women in Computer Vision. The goal of this workshop is to provide opportunities for women to share their experiences, work, and connect with role models in the field. Junior female students will also be able to share their work via poster sessions. .

Visual Question Answering and Dialog Workshop June 14, 2020

Facebook AI has helped organize a workshop on visual question answering and visual dialog. Douwe Kiela is a speaker, and co-organizers include Satwik Kottur, Dhruv Batra, and Devi Parikh. The focus of this session is to first benchmark the progress in this domain via visual question answering challenges, and then to discuss the state of the art approaches, including MMF: A framework for multimodal AI models. We’ll also talk about best practices, and future directions in multimodal AI. As part of this, we are pleased to announce that Facebook AI researchers and engineers Xinlei Chen, Duy Kien Nguyen, Vedanuj Goswami, Licheng Yu, and our former PhD intern Huaizu Jiang (with the University of Massachusetts Amherst) won VQA challenge. Learn more about their winning technique using convolution grid feature maps.

Visual Recognition for Images, Video, and 3D June 15, 2020

We're hosting a tutorial to discuss popular approaches and recent advancements in the family of visual recognition tasks for different input modalities. We will cover the most recent work on object recognition and scene understanding. This includes building 3D deep learning models with PyTorch3D. Leaders of this tutorial include: Ross Girshick, Saining Xie, Alexander Kirillov, Yuxin Wu, Christoph Feichtenhofer, Haoqi Fan, Georgia Gkioxari, Justin Johnson, Nikhila Ravi, Piotr Dollár, and Wan-Yen Lo

DeepVision June 19, 2020

Facebook AI leaders Yann LeCun and Joaquin Quiñonero Candela will be keynote speakers at the 7th DeepVision Workshop on June 19 for a full-day workshop, and Cristian Canton Ferrer is a co-organizer of the event. The topics will be centered on both theoretical and practical applications of a wide range of cutting-edge learning techniques, including learning with limited data, self-supervised learning, transfer learning, and more.

Computer Vision for AR/VR June 15, 2020

Facebook Reality Lab’s Michael Abrash will be speaking at the Fourth Workshop on Computer Vision for AR/VR. The aim of this workshop is to bring industry innovators and academic leaders in the AR/VR world to discuss the problems, applications, and the state of AR/VR systems in general. Co-organizers from Facebook include Fernando De la Torre, Matt Uyttendaele, Alexandru Eugen Ichim, and Weipeng Xu.

Embodied AI June 14, 2020

Facebook AI’s Franziska Meier will be speaking on our work in robotics at this year’s Embodied AI workshop. The goal of this workshop is to bring together researchers from the fields of computer vision, language, graphics, and robotics to share and discuss the current state of intelligent agents that can see, talk, act, and reason. The workshop features three challenges focusing on the problems of point navigation, object navigation, and the transfer of models from simulated environments to the real world. Many Facebook AI researchers are co-organizing this event: Julian Straub, Manolis Savva, Devi Parikh, Jitendra Malik, Oleksandr Maksymets, Abhishek Kadian, Aaron Gokaslan, and Dhruv Batra.

DynaVis: The 2nd International Workshop on Dynamic Scene Reconstruction June 14, 2020

Facebook Reality Labs’ Yaser Sheikh is a featured speaker at this year’s workshop on dynamic scene reconstruction, and Michael Zollhoefer is part of the organizing committee. The workshop aims to bring together leading experts in the field of general dynamic scene reconstruction.

Deep Learning for Geometric Computing June 14, 2020

Facebook AI’s Research Director, Jitendra Malik, will be a featured speaker at this year’s workshop on advancements in the state of the art in topological and geometric shape analysis using deep learning. Facebook AI’s Daniel Huber is also on the program committee of the event.

Adversarial Machine Learning in Computer Vision June 19, 2020

Facebook AI’s Laurens van der Maaten is a speaker at this year’s workshop on adversarial machine learning. Facebook AI’s Yuxin Wu and Kaiming He are on the program committee. Discussions will center around enhancing the model robustness against adversarial attacks. Among other similar topics, we’ll focus on strategies to improve current computer vision models.

Computer Vision For Fashion, Art, and Design June 19, 2020

Facebook AI’s Kristen Grauman and Tamara Berg are co-organizers at this year’s third workshop on Computer Vision for Fashion, Art, and Design. Kristen and Devi Parikh will be speakers at this event. We’re helping bring together artists, designers, and computer vision researchers and engineers in an effort to exchange ideas at the intersection of creative applications and computer vision. The workshop features two dataset challenges and a paper submission track.

Full list of Facebook AI research at CVPR 2020

12-in-1: Multitask vision and language representation learning
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multitask training regime. Our approach culminates in a single model on 12 datasets from four broad categories of tasks including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multitask framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multitask model can lead to further improvements, achieving performance at or above the state of the art.

A multigrid method for efficiently training video models
Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl

Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training has used a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High-resolution models perform well, but train slowly. Low-resolution models train faster but are less accurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, nonlocal, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pretraining, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8 percent absolute) on Kinetics-400 compared to baseline training. Code is available online.

ARCH: Animatable Reconstruction of Clothed Humans
Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, Tony Tung

In this paper, we propose ARCH (Animatable Reconstruction of Clothed Humans), a novel end-to-end framework for accurate reconstruction of animation-ready 3D clothed humans from a monocular image. Existing approaches to digitize 3D humans struggle to handle pose variations and recover details. Also, they do not produce models that are animation ready. In contrast, ARCH is a learned pose-aware model that produces detailed 3D rigged full-body human avatars from a single unconstrained RGB image. A Semantic Space and a Semantic Deformation Field are created using a parametric 3D body estimator. They allow the transformation of 2D/3D clothed humans into a canonical space, reducing ambiguities in geometry caused by pose variations and occlusions in training data. Detailed surface geometry and appearance are learned using an implicit function representation with spatial local features. Furthermore, we propose additional per-pixel supervision on the 3D reconstruction using opacity-aware differentiable rendering. Our experiments indicate that ARCH increases the fidelity of the reconstructed humans. We obtain more than 50 percent lower reconstruction errors for standard metrics compared to state-of-the-art methods on public datasets. We also show numerous qualitative examples of animated, high-quality reconstructed avatars unseen in the literature so far.

Articulation-aware Canonical Surface Mapping
Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani

We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape, and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on leveraging keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animate object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.

Classifying, segmenting, and tracking object instances in video with mask propagation
Gedas Bertasius, Lorenzo Torresani

We introduce a method for simultaneously classifying, segmenting, and tracking object instances in a video sequence. Our method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip. This allows our system to predict clip-level instance tracks with respect to the object instances segmented in the middle frame of the clip. Clip-level instance tracks generated densely for each frame in the sequence are finally aggregated to produce video-level object instance segmentation and classification. Our experiments demonstrate that our clip-level instance segmentation makes our approach robust to motion blur and object occlusions in video. MaskProp achieves the best reported accuracy on the YouTube-VIS dataset, outperforming the ICCV 2019 video instance segmentation challenge winner despite being much simpler and using orders of magnitude less labeled data (1.3M vs. 1B images and 860K vs. 14M bounding boxes). The project page is at:

Cluster and relearn: Improving generalization of visual representations
Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan

Pretraining convolutional neural networks with weakly supervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks. However, due to the lack of strong discriminative signals, these learned representations may overfit to the pretraining objective (e.g., hashtag prediction) and not generalize well to downstream tasks. In this work, we present a simple strategy — ClusterFit (CF) to improve the robustness of the visual representations learned during pretraining. Given a dataset, we (a) cluster its features extracted from a pretrained network using k-means and (b) retrain a new network from scratch on this dataset using cluster assignments as pseudo-labels. We empirically show that clustering helps reduce the pretraining task-specific information from the extracted features, thereby minimizing overfitting to the same. Our approach is extensible to different pretraining frameworks — weak- and self-supervised, modalities — images and videos, and pretraining tasks — object and action classification. Through extensive transfer learning experiments on 11 different target datasets of varied vocabularies and granularities, we show that CF significantly improves the representation quality compared with the state-of-the-art large-scale (millions / billions) weakly supervised image and video models and self-supervised image models.

Designing network design spaces
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollar

In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: Widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.

Don’t judge an object by its context: Learning to overcome contextual bias
Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, Deepti Ghadiyaram

Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy. However, strongly relying on context risks a model’s generalizability, especially when typical co-occurrence patterns are absent. This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations. Our goal is to accurately recognize a category in the absence of its context, without compromising on performance when it co-occurs with context. Our key idea is to decorrelate feature representations of a category from its co-occurring context. We achieve this by learning a feature subspace that explicitly represents categories occurring in the absence of context alongside a joint feature subspace that represents both categories and context. Our very simple yet effective method is extensible to two multilabel tasks – object and attribute classification. On four challenging datasets, we demonstrate the effectiveness of our method in reducing contextual bias.

EGO-TOPO: Environment affordances from egocentric video
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video. Project page:

End-to-end view synthesis from a single image
Olivia Wiles, Georgia Gkioxari, Richard Szeliski, Justin Johnson

View synthesis allows for the generation of new views of a scene given one or more images. This is challenging; it requires comprehensively understanding the 3D scene from images. As a result, current methods typically use multiple images, train on ground-truth depth, or are limited to synthetic data. We propose a novel end-to-end model for this task using a single image at test time; it is trained on real images without any ground-truth 3D information. To this end, we introduce a novel differentiable point cloud renderer that is used to transform a latent 3D point cloud of features into the target view. The projected features are decoded by our refinement network to inpaint missing regions and generate a realistic output image. The 3D component inside of our generative model allows for interpretable manipulation of the latent feature space at test time, e.g., we can animate trajectories from a single image. Additionally, we can generate high-resolution images and generalize to other input resolutions. We outperform baselines and prior work on the Matterport, Replica, and RealEstate10K datasets.

Epipolar transformers
Yihui He, Rui Yan, Katerina Fragkiadaki, Shoou-I Yu

A common approach to localize 3D human joints in a synchronized and calibrated multiview setup consists of two-steps: (1) Apply a 2D detector separately on each view to localize joints in 2D, and (2) perform robust triangulation on 2D detections from each view to acquire the 3D joint locations. However, in step 1, the 2D detector is limited to solving challenging cases that could potentially be better resolved in 3D, such as occlusions and oblique viewing angles, purely in 2D without leveraging any 3D information. Therefore, we propose the differentiable “epipolar transformer,” which enables the 2D detector to leverage 3D-aware features to improve 2D pose estimation. The intuition is: given a 2D location p in the current view, we would like to first find its corresponding point p 0 in a neighboring view, and then combine the features at p 0 with the features at p, thus leading to a 3D-aware feature at p. Inspired by stereo matching, the epipolar transformer leverages epipolar constraints and feature matching to approximate the features at p 0 . Experiments on InterHand and Human3.6M [13] show that our approach has consistent improvements over the baselines. Specifically, in the condition where no external data is used, our Human3.6M model trained with ResNet-50 backbone and image size 256×256 outperforms state of the art by 4.23 mm and achieves MPJPE 26.9 mm. Code is available.

FBNetV2: Differentiable neural architecture search for spatial and channel cimensionss
Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, Joseph E. Gonzalez

Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS’s search space is small when compared with other search methods’, since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to 1014x over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421x less search cost, DMaskingNAS finds models with 0.9 percent higher accuracy, 15 percent fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20 percent fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6 percent in accuracy, with equivalent model size. FBNetV2 models are open-sourced at

From Paris to Berlin: Discovering style influences around the world
Ziad Al-Halah, Kristen Grauman

The evolution of clothing styles and their migration across the world is intriguing yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from everyday images of people wearing clothes. We introduce an approach that detects which cities influence which other cities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a forecasting model that predicts the popularity of any given style at any given city into the future. Demonstrating our idea with GeoStyle — a large-scale dataset of 7.7M images covering 44 major world cities — we present the discovered influence relationships, revealing how cities exert and receive fashion influence for an array of 50 observed visual styles. Furthermore, the proposed forecasting model achieves state-of-the-art results for a challenging style forecasting task, showing the advantage of grounding visual style evolution both spatially and temporally.

From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality
Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, Alan Bovik

Blind or no-reference (NR) perceptual picture quality prediction is a difficult, unsolved problem of great consequence to the social and streaming media industries that impacts billions of viewers daily. Unfortunately, popular NR prediction models perform poorly on real-world distorted pictures. To advance progress on this problem, we introduce the largest (by far) subjective picture quality database, containing about 40,000 real-world distorted pictures and 120,000 patches, on which we collected about 4M human judgments of picture quality. Using these picture and patch quality labels, we built deep region-based architectures that learn to produce state-of-the-art global picture quality predictions as well as useful local picture quality maps. Our innovations include picture quality prediction architectures that produce global-to-local inferences as well as local-to-global inferences (via feedback). The dataset and source code are available at

GrappaNet: Combining Parallel Imaging With Deep Learning for Multi-Coil MRI Reconstruction
Anuroop Sriram, Jure Zbontar, Tullie Murrell, C. Lawrence Zitnick, Aaron Defazio, Daniel K. Sodickson

Magnetic Resonance Image (MRI) acquisition is an inherently slow process which has spurred the development of two different acceleration methods: acquiring multiple correlated samples simultaneously (parallel imaging) and acquiring fewer samples than necessary for traditional signal processing methods (compressed sensing). Both methods provide complementary approaches to accelerating the speed of MRI acquisition. In this paper, we present a novel method to integrate traditional parallel imaging methods into deep neural networks that is able to generate high quality reconstructions even for high acceleration factors. The proposed method, called GrappaNet, performs progressive reconstruction by first mapping the reconstruction problem to a simpler one that can be solved by a traditional parallel imaging methods using a neural network, followed by an application of a parallel imaging method, and finally fine-tuning the output with another neural network. The entire network can be trained end-to-end. We present experimental results on the recently released fastMRI dataset and show that GrappaNet can generate higher quality reconstructions than competing methods for both 4× and 8× acceleration.

Hierarchical scene coordinate classification and regression for visual localization
Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, Juho Kannala

Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a prebuilt 3D model. Recently, deep neural networks have been exploited to regress the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The network consists of a series of output layers, each of them conditioned on the previous ones. The final output layer predicts the 3D coordinates and the others produce progressively finer discrete location labels. The proposed method outperforms the baseline regression-only network and allows us to train compact models which scale robustly to large environments. It sets a new state of the art for single-image RGB localization performance on the 7-Scenes, 12-Scenes, Cambridge Landmarks datasets, and three combined scenes. Moreover, for large-scale outdoor localization on the Aachen Day-Night dataset, we present a hybrid approach which outperforms existing scene coordinate regression methods, and reduces significantly the performance gap with respect to explicit feature matching methods.

Improving lowshot object detection with weakly labeled Data
Vignesh Ramanathan, Rui Wang, Dhruv Mahajan

Large detection datasets have a long tail of lowshot classes with very few bounding box annotations. We wish to improve detection for lowshot classes with weakly labeled web-scale datasets only having image-level labels. This requires a detection framework that can be jointly trained with limited number of bounding box annotated images and large number of weakly labeled images. Toward this end, we propose a modification to the FRCNN model to automatically infer label assignment for objects proposals from weakly labeled images during training. We pose this label assignment as a linear program with constraints on the number and overlap of object instances in an image. We show that this can be solved efficiently during training for weakly labeled images. Compared with just training with few annotated examples, augmenting with weakly labeled examples in our framework provides significant gains. We demonstrate this on the LVIS dataset (3.5 percent gain in AP) as well as different lowshot variants of the COCO dataset. We provide a thorough analysis of the effect of amount of weakly labeled and fully labeled data required to train the detection model. Our DLWL framework can also outperform self-supervised baselines like omni-supervision for lowshot classes.

ImVoteNet: Boosting 3D object detection in point clouds with image votes
Charles R. Qi, Xinlei Chen, Or Litany, Leonidas J. Guibas

3D object detection has seen quick progress thanks to advances in deep learning on point clouds. A few recent works have even shown state-of-the-art performance with just point clouds input (e.g., VOTENET). However, point cloud data have inherent limitations. They are sparse, lack color information and often suffer from sensor noise. Images, on the other hand, have high resolution and rich texture. Thus they can complement the 3D geometry provided by point clouds. Yet how to effectively use image information to assist point cloud-based detection is still an open question. In this work, we build on top of VOTENET and propose a 3D detection architecture called IMVOTENET specialized for RGB-D scenes. IMVOTENET is based on fusing 2D votes in images and 3D votes in point clouds. Compared to prior work on multmodal detection, we explicitly extract both geometric and semantic features from the 2D images. We leverage camera parameters to lift these features to 3D. To improve the synergy of 2D-3D feature fusion, we also propose a multitower training scheme. We validate our model on the challenging SUN RGB-D dataset, advancing state-of-the-art results by 5.7 mAP. We also provide rich ablation studies to analyze the contribution of each design choice.

In defense of grid features for visual question answering
Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

Popularized as “bottom-up” attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g., better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA, and find they can work surprisingly well — running more than an order of magnitude faster with the same accuracy (e.g., if pretrained in a similar fashion). Through extensive experiments, we verify that this observation holds true across different VQA models and datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end to end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pretraining. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.

Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA
Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intramodality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multistep prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

Lightweight multiview 3D pose estimation through camera-disentangled representation
Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, Robert Wang

We present a lightweight solution to recover 3D pose from multiview images captured with spatially calibrated cameras. Building upon recent advances in interpretable representation learning, we exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. This allows us to reason effectively about 3D pose across different views without using compute-intensive volumetric grids. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2D detections that can be simply lifted to 3D via a differentiable Direct Linear Transform (DLT) layer. In order to do it efficiently, we propose a novel implementation of DLT that is orders of magnitude faster on GPU architectures than standard SVD-based triangulation methods. We evaluate our approach on two large-scale human pose datasets (H36M and Total Capture): Our method outperforms or performs comparably to the state-of-the-art volumetric methods, while, unlike them, yielding real-time performance.

Listen to look: Action recognition by previewing audio
Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani

In the face of the video data deluge, today’s expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an IMGAUD2VID framework that hallucinates clip-level features by distilling from lighter modalities — a single frame and its accompanying audio — reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on IMGAUD2VID, we further propose IMGAUD-SKIMMING, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state of the art in terms of both recognition accuracy and speed.

Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning [27] as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks. Code is available at

Object fusion
Martin Rünz, Kejie Li, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, Richard Newcombe

Object-oriented maps are important for scene understanding since they jointly capture geometry and semantics, and allow individual instantiation and meaningful reasoning about objects. We introduce FroDO, a method for accurate 3D reconstruction of object instances from RGB video that infers object location, pose, and shape in a coarse-to-fine manner. Key to FroDO is to embed object shapes in a novel learnt space that allows seamless switching between sparse point cloud and dense DeepSDF decoding. Given an input sequence of localized RGB frames, FroDO first aggregates 2D detections to instantiate a category-aware 3D bounding box per object. A shape code is regressed using an encoder network before optimizing shape and pose further under the learnt shape priors using sparse and dense shape representations. The optimization uses multiview geometric, photometric, and silhouette losses. We evaluate on real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view, multiview, and multiobject reconstruction.

One-Shot Domain Adaptation For Face Generation
Chao Yang, Ser-Nam Lim

In this paper, we propose a framework capable of generating face images that fall into the same distribution as that of a given one-shot example. We leverage a pre-trained StyleGAN model that already learned the generic face distribution. Given the one-shot target, we develop an iterative optimization scheme that rapidly adapts the weights of the model to shift the output’s high-level distribution to the target’s. To generate images of the same distribution, we introduce a style-mixing technique that transfers the low-level statistics from the target to faces randomly generated with the model. With that, we are able to generate an unlimited number of faces that inherit from the distribution of both generic human faces and the one-shot example. The newly generated faces can serve as augmented training data for other downstream tasks. Such setting is appealing as it requires labeling very few, or even one example, in the target domain, which is often the case of real-world face manipulations that result from a variety of unknown and unique distributions, each with extremely low prevalence. We show the effectiveness of our one-shot approach for detecting face manipulations and compare it with other few-shot domain adaptation methods qualitatively and quantitatively.

PIFuHD: Multilevel pixel-aligned implicit function for high-resolution 3D human digitization
Shunsuke Saito, Tomas Simon, Jason Saragih, Hanbyul Joo

Recent advances in image-based 3D human shape estimation have been driven by the significant improvement in representation power afforded by deep neural networks. Although current approaches have demonstrated the potential in real-world settings, they still fail to produce reconstructions with the level of detail often present in the input images. We argue that this limitation stems primarily from two conflicting requirements; accurate predictions require large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low-resolution images as input to cover large spatial context, and produce less precise (or low-resolution) 3D estimates as a result. We address this limitation by formulating a multilevel architecture that is end-to-end trainable. A coarse level observes the whole image at lower resolution and focuses on holistic reasoning. This provides context to a fine level which estimates highly detailed geometry by observing higher-resolution images. We demonstrate that our approach significantly outperforms existing state-of-the-art techniques on single-image human shape reconstruction by fully leveraging 1K-resolution input images.

PointRend: Image segmentation as rendering
Alexander Kirillov, Yuxin Wu, Kaiming He, Ross Girshick

We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are oversmoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend’s efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at

Pretext invariant self-supervised representation learning
Ishan Misra, Laurens van der Maaten

The goal of self-supervised learning from images is to construct image representations that are semantically meaningful via pretext tasks that do not require semantic annotations. Many pretext tasks lead to representations that are covariant with image transformations. We argue that, instead, semantic representations ought to be invariant under such transformations. Specifically, we develop Pretext-Invariant Representation Learning (PIRL, pronounced “pearl”) that learns invariant representations based on pretext tasks. We use PIRL with a commonly used pretext task that involves solving jigsaw puzzles. We find that PIRL substantially improves the semantic quality of the learned image representations. Our approach sets a new state of the art in self-supervised learning from images on several popular benchmarks for self-supervised learning. Despite being unsupervised, PIRL outperforms supervised pretraining in learning image representations for object detection. All together, our results demonstrate the potential of self-supervised representations with good invariance properties.

Transferring Dense Pose to Proximal Animal Classes
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova

Recent contributions have demonstrated that it is possible to recognize the pose of humans densely and accurately given a large dataset of poses annotated in detail. In principle, the same approach could be extended to any animal class, but the effort required for collecting new annotations for each case makes this strategy impractical, despite important applications in natural conservation, science and business. We show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in dense pose recognition for humans, as well as in more general object detectors and segmenters, to the problem of dense pose recognition in other classes. We do this by (1) establishing a DensePose model for the new animal which is also geometrically aligned to humans (2) introducing a multi-head R-CNN architecture that facilitates transfer of multiple recognition tasks between classes, (3) finding which combination of known classes can be transferred most effectively to the new animal and (4) using self-calibrated uncertainty heads to generate pseudo-labels graded by quality for training a model for this class. We also introduce two benchmark datasets labelled in the manner of DensePose for the class chimpanzee and use them to evaluate our approach, showing excellent transfer learning performance.

Use the Force, Luke! Learning to predict physical forces by simulating effects
Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta

When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects, and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

VPLNet: Deep single view normal estimation with vanishing points and lines
Rui Wang, David Geraghty, Kevin Matzen, Jan-Michael Frahm, and Richard Szeliski

We present a novel single-view surface normal estimation method that combines traditional line and vanishing point analysis with a deep learning approach. Starting from a color image and a Manhattan line map, we use a deep neural network to regress on a dense normal map, and a dense Manhattan label map that identifies planar regions aligned with the Manhattan directions. We fuse the normal map and label map in a fully differentiable manner to produce a refined normal map as final output. To do so, we softly decompose the output into a Manhattan part and a non-Manhattan part. The Manhattan part is treated by discrete classification and vanishing points, while the non-Manhattan part is learned by direct supervision.

Our method achieves state-of-the-art results on standard single-view normal estimation benchmarks. More importantly, we show that by using vanishing points and lines, our method has better generalization ability than existing works. In addition, we demonstrate how our surface normal network can improve the performance of depth estimation networks, both quantitatively and qualitatively, in particular, in 3D reconstructions of walls and other flat surfaces.

ViBE: Dressing for diverse body shapes
Wei-Lin Hsiao, Kristen Grauman

Body shape plays an important role in determining what garments will best suit a given person, yet today’s clothing recommendation methods take a “one shape fits all” approach. These body-agnostic vision methods and datasets are a barrier to inclusion, ill-equipped to provide good suggestions for diverse body shapes. We introduce ViBE, a VIsual Body-aware Embedding that captures clothing’s affinity with different body shapes. Given an image of a person, the proposed embedding identifies garments that will flatter her specific body shape. We show how to learn the embedding from an online catalog displaying fashion models of various shapes and sizes wearing the products, and we devise a method to explain the algorithm’s suggestions for well-fitting garments. We apply our approach to a dataset of diverse subjects, and demonstrate its strong advantages over status quo body-agnostic recommendation, both according to automated metrics and human opinion.

> Video classification with correlation networks
Heng Wang, Du Tran, Lorenzo Torresani, Matt Feiszli

Motion is a salient cue to recognize actions in video. Modern action recognition models leverage motion information either explicitly by using optical flow as input or implicitly by means of 3D convolutional filters that simultaneously capture appearance and motion information. This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network. The proposed architecture enables the fusion of this explicit temporal matching information with traditional appearance cues captured by 2D convolution. Our correlation network compares favorably with widely-used 3D CNNs for video modeling, and achieves competitive results over the prominent two-stream network while being much faster to train. We empirically demonstrate that correlation networks produce strong results on a variety of video datasets, and outperform the state of the art on three popular benchmarks for action recognition: Kinetics, Something-Something and Diving48.

Visual navigation via neural topological mapping
Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, Saurabh Gupta

This paper studies the problem of image-goal navigation which involves navigating to the location indicated by a goal image in a novel previously unseen environment. To tackle this problem, we design topological representations for space that effectively leverage semantics and afford approximate geometric reasoning. At the heart of our representations are nodes with associated semantic features, that are inter-connected using coarse geometric information. We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation. Experimental study in visually and physically realistic simulation suggests that our method builds effective representations that capture structural regularities and efficiently solve long-horizon navigation problems. We observe a relative improvement of more than 50% over existing methods that study this task.

What makes training multi-modal networks hard?
Weiyao Wang, Du Tran, Matt Feiszli

Consider end-to-end training of a multi-modal vs. a unimodal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its uni-modal counterpart. In our experiments, however, we observe the opposite: the best uni-modal network often outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks for video classification.

This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to their increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient-Blending, which computes an optimal blending of modalities based on their overfitting behaviors. We demonstrate that Gradient-Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.

Wish You Were Here: Context Aware Human Generation
Oran Gafni, Lior Wolf

We present a novel method for inserting objects, specifically humans, into existing images, such that they blend in a photorealistic manner, while respecting the semantic context of the scene. Our method involves three subnetworks: the first generates the semantic map of the new person, given the pose of the other persons in the scene and an optional bounding box specification. The second network renders the pixels of the novel person and its blending mask, based on specifications in the form of multiple appearance components. A third network refines the generated face in order to match those of the target person. Our experiments present convincing high-resolution outputs in this novel and challenging application domain. In addition, the three networks are evaluated individually, demonstrating for example, state of the art results in pose transfer benchmarks.

X3D: Expanding Architectures for Fast Video Recognition
Christoph Feichtenhofer

This paper presents X3D, a family of efficient video networks by progressively expanding a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity tradeoff is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. Our most surprising finding is that a Fast network with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks.

You2Me: Inferring body pose in egocentric video via first and second person interactions
Evonne Ng, Donglai Xiang, Hanbyul Joo, Kristen Grauman

The body pose of a person wearing a camera is of great interest for applications in augmented reality, healthcare, and robotics, yet much of the person’s body is out of view for a typical wearable camera. We propose a learning-based approach to estimate the camera wearer’s 3D body pose from egocentric video sequences. Our key insight is to leverage interactions with another person—whose body pose we can directly observe—as a signal inherently linked to the body pose of the first-person subject. We show that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view. We demonstrate our idea on a variety of domains with dyadic interaction and show the substantial impact on egocentric body pose estimation, which improves the state of the art.