The PE Video Dataset (PVD) is a large-scale collection of 1 million diverse videos, featuring 120,000+ expertly annotated clips. The dataset was introduced in our paper "Perception Encoder".
PE Video Dataset (PVD) comprises 1M high quality and diverse videos. Among them, 120K videos are accompanied by automated and human-verified annotations and all videos are accompanied with video description and keywords. The videos are motion-centered, covering both first-person and third-person views with a wide coverage of scenes.
Computer Vision, Video Understanding
Train and evaluate video retrieval models
Train and evaluate video captioning models
Videos
Video caption (Human annotated / Model generated)
Training, Testing
Total number of videos: 998,862
Total number of human annotated captions: 118,862
Average FPS: 29.8
Average Video Length: 16.7s
Average video height: 346
Average video width: 604
A text description that summarizes the content of a video describing what's happening in the video, such as the actions, events, or objects shown.
We selected videos from 10 different categories, including hand actions, object interactions, food preparation, work activities, outdoor scenes, animals, water scenes, object handling, close-up shots, and nature scenes.
CC BY NC 4.0
Open access
The video captions are refined based on the following criteria. The annotators should remove any hallucinations found in the model-generated caption, correct words that describe the video inaccurately, and eliminate repeating or redundant words to make the caption concise and accurate. Additionally, if major actions are missing from the caption, annotators should add them in a concise and natural way.
All of the 118,862 human captions were reviewed by human annotators.
Our approach
Latest news
Foundational models