October 13, 2021
For applications ranging from self-driving cars to augmented reality, it is important that artificial intelligence systems can anticipate people’s future actions. As someone’s building an IKEA dresser, they may find themselves wondering whether the next step is to attach the legs or the drawers. A friend could helpfully suggest the correct part to add, based on the steps followed so far. But this type of anticipation is a challenging task for AI that requires both predicting the multimodal distribution of future activities and modeling the progression of past actions.
To address this important challenge, we’ve leveraged recent developments in Transformer architectures, especially for natural language processing and image modeling, to build Anticipative Video Transformer (AVT), an end-to-end attention-based model for action anticipation in videos. Compared with previous approaches, it’s better at understanding long-range dependencies, like how someone’s past cooking steps indicate what they will do next.
AVT could be especially useful for applications such as an AR “action coach” or an AI assistant, by prompting someone that they may be about to make a mistake in completing a task or by reacting ahead of time with a helpful prompt for the next step in a task. For example, AVT could warn someone that the pan they’re about to pick up is hot, based on the person’s previous interactions with the pan.
We’re confident that AVT can quickly advance action anticipation performance across applications like these and others, as evidenced by our model outperforming existing state-of-the-art architectures on four popular benchmarks — EGTEA Gaze+, 50-Salads, EPIC-Kitchens-55, and, most notably, winning the Action Anticipation challenge in the EPIC-Kitchens 2021 competition. (The Epic-Kitchens-55 data set is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.)
Most prior approaches to action anticipation struggle with modeling the sequential long-range dependencies. For example, predicting the next action of someone making an omelette — chopping onions or heating a pan — depends on the sequence of actions they’ve already performed.
But AVT is attention based, so it can process a full sequence in parallel. By comparison, recurrent neural network–based approaches often forget the past, as they need to be processed sequentially. AVT also features loss functions that encourage the model to capture the sequential nature of video, which would otherwise be lost by attention-based architectures such as nonlocal networks.
AVT consists of two parts: an attention-based backbone (AVT-b) that operates on frames of video and an attention-based head architecture (AVT-h) that operates on features extracted by the backbone. Our best action anticipation came from training the full architecture end to end, but AVT-h is also compatible with standard video backbones like 3D convolutional networks. That’s important because learning better video backbones is an active research area and we want AVT to be useful with the latest and greatest video backbones, such as Multiscale Vision Transformers.
The AVT-b backbone is based on the Vision Transformer (VIT) architecture. It splits frames into non-overlapping patches, embeds them with a feedforward network, appends a special classification token, and applies multiple layers of multihead self-attention. We then share the weights across the frames and use the features corresponding to the classification token for the head.
The head architecture takes the per-frame features and applies another Transformer architecture with causal attention. This means that it evaluates features only from the current and preceding frames. This in turn allows the model to rely solely on past features when generating a representation of any individual frame. That’s crucial for prediction.
For example, in the video above, the model first encodes the visual features from the tap being turned on, moves on to each tomato being washed, and finally predicts that the next action will be turning off the tap.
We train the model to predict future actions and features using three losses. First, we classify the features in the last frame of a video clip in order to predict labeled future action; second, we regress the intermediate frame feature to the features of the succeeding frames, which trains the model to predict what comes next; third, we train the model to classify intermediate actions. We’ve shown that by jointly optimizing the three losses, our model predicts future actions 10 percent to 30 percent better than models trained only with bidirectional attention. These additional losses make AVT better suited for long-range reasoning because they provide additional supervision for the model. It also demonstrates that its performance improves by incorporating longer and longer context.
People make countless decisions every day based on their understanding of the world around them not just as a static, fixed set of inputs, but as a connected sequence of events. AI models offer great promise to help with many tasks with and for people, but in order to maximize this potential, they need this anticipative ability too. AVT is an important step in this direction.
Because it’s built on top of the causal decoder architecture, AVT can be easily rolled out autoregressively to predict longer into the future, anticipating not just the next action but also several sequential actions the user might do. That could one day prove useful for long-term planning tasks, such as AR glasses observing that the person wearing them is changing a flat tire. The system could anticipate the required series of steps for that task and prompt the wearer to select the specific tools they need, even a few steps into the future, when they walk over to their toolshed to get them.
Looking ahead, we believe AVT could be helpful for tasks beyond anticipation, such as self-supervised learning, the discovery of action schemas and boundaries, and even for general action recognition in tasks that require modeling the chronological sequence of actions. These are some of the areas we’re excited to explore in future work.