A new dataset to train and benchmark AI systems to better understand actions in videos — in particular, actions that can’t be determined by viewing just a single frame. Current video datasets often focus on actions where a single image is enough for recognition, such as washing dishes, eating pizza, or playing guitar. To improve computer vision systems’ understanding of elements that can be recognized only in a video sequence — such as whether someone is sneezing or opening a door — we discovered a set of actions where temporal information is essential for recognition.
We’re now sharing this work, along with our methodology for determining those classes and results from training networks on it, in order to help researchers benchmark their systems’ ability to recognize temporal actions.
To discover which actions in video should be designated as temporal classes, we presented annotators with video clips from existing video recognition datasets, with their frames shuffled out of order. If annotators couldn’t identify a given action, we determined that temporal information was essential to recognition and added that class to our dataset. We discovered 50 such temporal action classes in all, which were associated with a total of 35,504 publicly available videos from the Kinetics and Something-Something benchmarks. Our resulting list of classes, called the Temporal dataset, doesn’t consist of video content; rather, it consists of the classes associated with specific clips in those benchmarks.
To evaluate the utility of our dataset, we used it to benchmark current video recognition methods and found that some state-of-the-art networks capture more image information than temporal information. We also used this new Temporal dataset to train existing video recognition networks and found that they became more sensitive to temporal changes and less reliant on image information, which improved their ability to generalize to new, unseen action classes.
Our training results indicate that incorporating temporal data can improve the overall performance of video understanding systems. But our work also suggests that current video datasets underrepresent classes where temporal information is essential for understanding, which could bias progress toward understanding images, rather than understanding the kinds of actions that are recognizable only in videos. Our dataset — which isn’t a separate, downloadable resource and can be found in the appendix of the paper below — will help researchers assess and improve their systems’ ability to use temporal information, while also pushing the field to incorporate that information into future video datasets.