APRIL 17, 2025

PE Video Dataset (PVD)

The PE Video Dataset (PVD) is a large-scale collection of 1 million diverse videos, featuring 120,000+ expertly annotated clips. The dataset was introduced in our paper "Perception Encoder".

PE Video Dataset graphic

Overview

PE Video Dataset (PVD) comprises 1M high quality and diverse videos. Among them, 120K videos are accompanied by automated and human-verified annotations and all videos are accompanied with video description and keywords. The videos are motion-centered, covering both first-person and third-person views with a wide coverage of scenes.

PVD

Key Application

Computer Vision, Video Understanding

Intended Use Cases

  • Train and evaluate video retrieval models

  • Train and evaluate video captioning models

Primary Data Type

  • Videos

  • Video caption (Human annotated / Model generated)

Data Function

Training, Testing

Dataset Characteristics

  • Total number of videos: 998,862

  • Total number of human annotated captions: 118,862

  • Average FPS: 29.8

  • Average Video Length: 16.7s

  • Average video height: 346

  • Average video width: 604

Labels

A text description that summarizes the content of a video describing what's happening in the video, such as the actions, events, or objects shown.

Nature Of Content

We selected videos from 10 different categories, including hand actions, object interactions, food preparation, work activities, outdoor scenes, animals, water scenes, object handling, close-up shots, and nature scenes.

License

CC BY NC 4.0

Access Cost

Open access

Labeling Methods

The video captions are refined based on the following criteria. The annotators should remove any hallucinations found in the model-generated caption, correct words that describe the video inaccurately, and eliminate repeating or redundant words to make the caption concise and accurate. Additionally, if major actions are missing from the caption, annotators should add them in a concise and natural way.

Validation Methods

All of the 118,862 human captions were reviewed by human annotators.