May 18, 2021
PyTorchVideo is a deep learning library for research and applications in video understanding. It provides easy-to-use, efficient, and reproducible implementations of state-of-the-art video models, data sets, transforms, and tools in PyTorch.
The PyTorchVideo library supports components that can be used for a variety of video understanding tasks, such as video classification, detection, self-supervised learning, and optical flow. More importantly, it is not limited to visual signals: PyTorchVideo also supports other modalities, including audio and text. Furthermore, PyTorchVideo is not limited to desktop devices: The Accelerator package provides mobile hardware–specific optimizations and model deployment flow, pushing the boundaries for on-device performance.
Features that allow PyTorchVideo to accelerate a project include:
A suite of state-of-the-art video models and their pretrained weights with customizable components that enable researchers to build new video architectures.
A set of downstream tasks including action classification, acoustic event detection, action detection, and self-supervised learning (SSL).
Support a wide variety of data sets and tasks for benchmarking various video models under different evaluation protocols.
Efficient building blocks and deployment flow optimized for inference on hardware (mobile device, Intel NNPI, etc.), enabling hardware-aware model design and full-speed on-device model execution.
A growing toolkit of common scripts for video processing, including decoding, tracking, and optimal flow extracting.
Going forward, we are committed to continue enhancing the PyTorchVideo library to enable and support more groundbreaking research in video understanding. We welcome contributions from the entire community. All our efforts will be directed at supporting the rich open source community committed to pushing the boundaries of video research.
Understanding video is one of the grand challenges of computer vision. Increases in computational resources and the amount of video data on the web are leading to more advances in the field. However, the scale, richness, and difficulty in analyzing video data means there is a strong demand for cutting-edge models that are effective and efficient, infrastructure, and tools for video understanding.
PyTorchVideo aims to meet that demand by providing a unified repository of reproducible and efficient video understanding components that are readily available for centralized use in research and production applications.
Another major challenge is the lack of a standardized, video-focused library that serves a variety of video use cases in one place. This has created a barrier to entry for developers looking to work with videos for the first time. Lack of standardization also makes it difficult to collaborate and to build upon others’ work. In this regard, PyTorchVideo is our sincere effort to address some of these bottlenecks.
At Facebook, PyTorchVideo supports state-of-the-art research works from FAIR, such as:
X3D: Expanding architectures for efficient video recognition
A closer look at spatiotemporal convolutions for action recognition
Video classification with channel-separated convolutional networks
It also has been used to power recent advances in video transformers and self-supervised learning, such as:
A large-scale study on unsupervised spatiotemporal representation learning
Multiview pseudo-labeling for semi-supervised learning from video
Unidentified video objects: A benchmark for dense, open-world segmentation
Is space-time attention all you need for video understanding?
PyTorchvideo is supported and developed by the following contributors: Tullie Murrell, Haoqi Fan, Kalyan Vasudev Alwala, Yilei Li, Yanghao Li, Heng Wang, Bo Xiong, Nikhila Ravi, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Jitendra Malik, Ross Girshick, and Christoph Feichtenhofer
Foundational models
Latest news
Foundational models