August 11th, 2021
The idea of hierarchical visual representations in the human brain is quite well established. In 1960, D.H. Hubel and T.N. Wiesel developed a hierarchical model of the visual pathway in which neurons in lower areas of the brain, like the primary visual cortex, responded to features like oriented edges and bars, while in higher areas, the neurons responded to more specific stimuli. Several decades later, Kunihiko Fukushima proposed the Neocognitron, a neural network architecture for pattern recognition that was explicitly inspired by Hubel and Wiesel’s hierarchy. This central theme remains evident to this day in convolutional neural networks that build multiscale hierarchical representations of the input.
Facebook AI has built Multiscale Vision Transformers (MViT), a Transformer architecture for representation learning from visual data such as images and videos. It’s a family of visual recognition models that incorporate the seminal concept of hierarchical representations into the powerful Transformer architecture. MViT is the first such system that can train entirely from scratch on a video recognition data set (like Kinetics 400) and achieve state-of-the-art performance across a variety of transfer learning tasks, like video classification and human action localization.
When presented with an image or video, MViT models identify the objects present in the image or the actions being performed in the video. The trained models perform competitively on the Kinetics and ImageNet classification data sets and transfer well to downstream tasks such as action recognition on data sets like Charades, Something-Something, and Atomic Visual Actions (AVA). In the future, the application of MViT to videos and images in the wild may help contribute to machines that are better at analyzing uncurated sights of the real world, not just the elements of far smaller, hand-curated data sets.
The central advance of MViT is developing a spatiotemporal feature hierarchy within the Transformer backbone. Typical Vision Transformer models use a constant resolution and feature dimension throughout all layers and an attention mechanism to determine which previous tokens it should focus on. In MViT, we replace that with a pooling attention mechanism that pools the projected query, key, and value vectors, enabling reduction of the visual resolution. We couple this with increasing the channel dimension to construct a hierarchy from simple features with high visual resolution to more complex, high-dimensional features with low resolution.
MViT marks a significant improvement over prior attempts at video understanding with Transformers, which require computationally expensive pretraining on massive data sets (such as ImageNet-21K) and are extremely parameter-dense, requiring multistep training schemes. In contrast, MViT trains from scratch in a single step with no external pretraining. It also significantly improves state-of-the-art performance across well-studied recognition benchmarks, like ImageNet, Kinetics-400, Kinetics-600, AVA, etc.
Further, MViT models demonstrate superior understanding of temporal cues without getting pinned down in spurious spatial biases, a common pitfall of prior methods. Though much more work is needed, the advances enabled by MViT could significantly improve detailed human action understanding, which is a crucial component in real-world AI applications such as robotics and autonomous vehicles. In addition, innovations in video recognition architectures are an essential component of robust, safe, and human-centric AI.
We are grateful for discussions with Chao-Yuan Wu, Ross Girshick, and Kaiming He and to Chen Wei for help with visualizations.