APRIL 17, 2025

PLM Data

PLM Data is a comprehensive collection of synthetic and human-annotated datasets for detailed visual understanding, combining existing and newly collected data for tasks like OCR, captioning, and visual question answering.

Overview

PLM Data is a collection of datasets proposed from the paper "PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding". It consists of both synthetic and human-annotated samples.

Synthetic data is sourced from a wide range of image and video datasets covering basic capabilities of VLMs such as OCR, chart/document/diagram understanding, image/video captioning, and visual question answering. Human-annotated data provides rich, high-quality supervision for challenging image and video tasks. We combine existing human annotations on diverse image and video sources, with our own collected human-annotated data, geared towards fine-grained video understanding and spatio-temporally grounded reasoning.

PLM Video Human

Dataset Summary

PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models, focused on detailed video understanding. Training tasks include: fine-grained open-ended question answering (FGQA), Region-based Video Captioning (RCap), Region-based Dense Video Captioning (RDCap) and Region-based Temporal Localization (RTLoc).

Fine-grained QA (FGQA) and Spatio-Temporal Captions (STC)

Dataset Structure

PLM-Video-Human includes the following training datasets:

Fine-Grained Question Answering (FGQA)

A video question answering dataset for fine-grained activity understanding. Contains human-annotated/verified answers to model-generated questions about video clips from open-access video datasets. The questions focus on "what" activities humans perform and "how" they perform these activities.

Region Video Captioning (RCap)

Each training sample is a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified video segment (time interval), the target is a caption that accurately describes the event occurring within that interval.

Region Temporal Localization (RTLoc)

Each training sample is a precise time interval within the video corresponding to a detailed description of an event involving a subject of interest in the video. Given a video, a region masklet and a textual description of the event, the targets are the start and end timestamps that correspond to the occurrence of the event.

Notably, this task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.

Region Dense Temporal Captioning (RDCap)

Each training sample is a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the target is a sequence of (start, end, caption) triplets that cover the entire duration of the video, including periods when the subject is not visible.

Dataset Statistics

Dataset statistics

Licensing Information

CC BY 4.0

PLM Video Bench

Dataset Summary

PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding.

Fine-grained QA (FGQA) and Spatio-Temporal Captions (STC)

Supported Tasks

PLM-VideoBench includes eval data for the following tasks:

FGQA

In this task, a model must answer a multiple-choice question (MCQ) that probes fine-grained activity understanding. Given a question and multiple options that differ in a fine-grained detail (e.g., painting vertically vs. horizontally), the model must select the correct answer. To reduce bias, we follow prior work and report multi-binary accuracy (MBAcc). Specifically, each question is split into multiple binary-choice questions, where the correct answer is compared with one distractor at a time; a prediction is considered correct only when the correct answer is consistently selected across all binary comparisons.

SGQA

In this task, a model must answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smart-glasses device. The questions are designed to simulate real-world scenarios where a user would ask for assistance from their smart glasses, such as "which of these two jackets would look better with this pair of shoes?" or "does this pasta look strained enough to you?". The source videos used to construct this benchmark component were independently collected and are not based on existing publicly available data. To evaluate performance we use LLM-judge accuracy.

RCap

In this task, the model must generate a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified time interval, the model is required to output a caption that accurately describes the event occurring within that interval. The test set contains 10060 instances. We report LLM-judge accuracy to assesses the quality of the generated captions.

RTLoc

In this task, the model must identify the precise time interval within the video when the specified event takes place for the given subject. Given a video, a region masklet and a textual description of the event, the model is required to output the start and end timestamps that correspond to the occurrence of the event.

Notably, this task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.

RDCap

In this task, a model must generate a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the model must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible. We report SODA score, which leverages an LLM judge to assess the quality of the generated captions.

Data Stats

Data stats for PLM-VideoBench

Licensing Information

CC BY 4.0

PLM Video Auto

Dataset Summary

Synthetic video captions and MCQs used in PLM, please refer to the paper, Section 3, for more details. The synthetic annotations cover: YT-1B, Ego4d with captions, YT-1B with MCQAs and Ego4d with QAs.

Data Stats

Data stats for ego4d_qa

Licensing Information

This data is an output from Llama 3.2, and subject to the Llama 3.2 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Use of the data to train, fine tune, or otherwise improve an AI model, which is distributed or made available, shall also include "Llama" at the beginning of any such AI model name.

PLM Image Auto

Dataset Summary

Synthetic image captions and QAs used in PLM, please refer to the paper, Section 3, for more details. The synthetic annotations covers: SA1B, Openimages, Object365, ArxivQA, UCSF, PDFAcc.

Data Stats

A sample from Misc captions

Licensing Information

This data is an output from Llama 3.2, and subject to the Llama 3.2 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Use of the data to train, fine tune, or otherwise improve an AI model, which is distributed or made available, shall also include "Llama" at the beginning of any such AI model name.