Computer Vision

TimeSformer: A new architecture for video understanding

March 12, 2021

What the research is:

Facebook AI has built and is now sharing details about TimeSformer, an entirely new architecture for video understanding. It is the first video architecture that’s based purely on Transformers, which in recent years have become the dominant approach for many applications in natural language processing (NLP), including machine translation and general language understanding.

TimeSformer (from Time-Space Transformer) achieves the best reported numbers on several challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Furthermore, compared with modern 3D convolutional neural networks (CNNs), TimeSformer is roughly three times faster to train and requires less than one-tenth the amount of compute for inference. This is an important step toward supporting applications requiring real-time or on-demand processing of video.

Additionally, the scalability of TimeSformer enables the training of much larger models on much longer video clips. This opens the door to AI systems that can understand more complex human actions in videos, such as activities involving multiple atomic steps (e.g., repairing a car, preparing a meal, etc). This could be beneficial for many AI applications that require an understanding of complex human behaviors.

Video classification accuracy of TimeSformer versus state-of-the-art 3D convolutional neural networks on the action recognition benchmarks of Kinetics-400 (left) and Kinetics-600 (right). TimeSformer achieves the best reported accuracy on both data sets.

How it works:

Traditional video classification models leverage 3D convolutional filters. While such filters are effective at capturing short-range patterns within local spatiotemporal regions, they simply cannot model space-time dependencies that extend beyond their small receptive fields.

TimeSformer, however, is built exclusively on the self-attention mechanism used in Transformer models, which makes it possible to capture space-time dependencies over the entire video. In order to apply Transformers to video, our model interprets the input video as a time-space sequence of image patches extracted from the individual frames. This format is akin to that used in NLP, where Transformers view sentences as sequences of feature vectors computed from the individual words. Precisely as NLP Transformers infer the meaning of each word by comparing it with all the other words in the sentence — a procedure known as self-attention — our model captures the semantics of each patch by explicitly comparing it with the other patches in the video. This makes it possible to capture short-term dependencies between neighboring patches as well as long-range correlations between distant patches.

Traditional 3D convolutional neural networks also have high computational cost, as they require sliding a large set of filters over all space-time locations of the video. TimeSformer maintains a low computational cost by 1) decomposing the video into a small set of non-overlapping patches, and 2) applying a form of self-attention that avoids exhaustive comparison between all pairs of patches. We call this scheme divided space-time attention. The idea is to separately apply temporal attention and spatial attention, one after the other.

When temporal attention is used, each patch (e.g., the square colored in blue in the figure below) is compared only with patches at the same spatial location in the other frames (green-colored squares). If the video contains T frames, only T temporal comparisons are made for each patch. When spatial attention is applied, the patch is compared only with patches within the same frame (red-colored patches). Thus, if N is the number of patches in each frame, divided space-time attention performs in total only (T+N) comparison per patch, versus the (T*N) comparisons needed by the exhaustive method of joint space-time attention. Furthermore, we found that divided space-time attention is not only more efficient but also more accurate than joint space-time attention.

The scalability of TimeSformer allows it to operate on extremely long clips (e.g., sequence of 96 frames spanning a temporal extent of 102 seconds) in order to perform super-long-range temporal modeling. This represents a significant departure from current 3D CNNs, which are limited to processing clips of at most a handful of seconds, and is a critical requirement for the recognition of long-form activities. Consider, for example, a video demonstrating how to make french toast. An AI model analyzing a handful of seconds at a time may recognize some of the atomic actions (e.g., beating the eggs or pouring milk into a bowl). But classifying each individual action is not sufficient to classify the complex activity (many recipes involve egg beating). TimeSformer can analyze the video over much longer temporal extents, which reveal disambiguating dependencies among the atomic actions (e.g., combining milk with beaten eggs).

The efficiency of TimeSformer makes it possible to train models at high spatial resolution (e.g., frames of up to 560x560 pixels) and over long videos (including up to 96 frames). These plots show video classification cost (in TFLOPs) as a function of spatial resolution (left) and video length (right). From these plots, we can observe the dramatic computational savings provided by divided space-time attention over exhaustive joint space-time attention, especially when applied to large frames or long videos. In practice, joint space-time attention causes a GPU memory overflow once the spatial frame resolution reaches 448 pixels or the number of frames is increased to 32, effectively making it not applicable to large frames or long videos.

The figure provides a visualization of the self-attention heatmaps learned by TimeSformer. The first row shows the original frames, while the second row weights the color of each pixel by the importance given by self-attention for the classification of the video (pixels deemed unimportant become dark). As shown, TimeSformer learns to attend to the relevant regions in the video in order to perform complex spatiotemporal reasoning.

Why it matters:

To train video-understanding models, the best 3D CNNs today can only use video segments that are a few seconds long. With TimeSformer, we are able to train on far longer video clips — up to several minutes long. This may dramatically advance research to teach machines to understand complex long-form actions in videos, which is an important step for many AI applications geared toward human behavior understanding (e.g., an AI assistant).

Furthermore, the low inference cost of TimeSformer is an important step toward supporting future real-time video processing applications, such as AR/VR, or intelligent assistants that provide services based on video taken from wearable cameras. We also believe that the reduced cost of our approach will enable more researchers to tackle video analysis problems, thus expediting progress in this area.

Finally, we hope that the strong performance achieved by TimeSformer will lead the research field to embrace this new promising approach to video modeling.


Is space-time attention all you need for video understanding?

Written By

Gedas Bertasius

Postdoctoral Researcher

Heng Wang

Applied Research Scientist

Lorenzo Torresani

Research Scientist