PLM Data is a comprehensive collection of synthetic and human-annotated datasets for detailed visual understanding, combining existing and newly collected data for tasks like OCR, captioning, and visual question answering.
PLM Data is a collection of datasets proposed from the paper "PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding". It consists of both synthetic and human-annotated samples.
Synthetic data is sourced from a wide range of image and video datasets covering basic capabilities of VLMs such as OCR, chart/document/diagram understanding, image/video captioning, and visual question answering. Human-annotated data provides rich, high-quality supervision for challenging image and video tasks. We combine existing human annotations on diverse image and video sources, with our own collected human-annotated data, geared towards fine-grained video understanding and spatio-temporally grounded reasoning.
PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models, focused on detailed video understanding. Training tasks include: fine-grained open-ended question answering (FGQA), Region-based Video Captioning (RCap), Region-based Dense Video Captioning (RDCap) and Region-based Temporal Localization (RTLoc).
PLM-Video-Human includes the following training datasets:
A video question answering dataset for fine-grained activity understanding. Contains human-annotated/verified answers to model-generated questions about video clips from open-access video datasets. The questions focus on "what" activities humans perform and "how" they perform these activities.
Each training sample is a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified video segment (time interval), the target is a caption that accurately describes the event occurring within that interval.
Each training sample is a precise time interval within the video corresponding to a detailed description of an event involving a subject of interest in the video. Given a video, a region masklet and a textual description of the event, the targets are the start and end timestamps that correspond to the occurrence of the event.
Notably, this task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.
Each training sample is a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the target is a sequence of (start, end, caption) triplets that cover the entire duration of the video, including periods when the subject is not visible.
CC BY 4.0
PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding.
PLM-VideoBench includes eval data for the following tasks:
In this task, a model must answer a multiple-choice question (MCQ) that probes fine-grained activity understanding. Given a question and multiple options that differ in a fine-grained detail (e.g., painting vertically vs. horizontally), the model must select the correct answer. To reduce bias, we follow prior work and report multi-binary accuracy (MBAcc). Specifically, each question is split into multiple binary-choice questions, where the correct answer is compared with one distractor at a time; a prediction is considered correct only when the correct answer is consistently selected across all binary comparisons.
In this task, a model must answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smart-glasses device. The questions are designed to simulate real-world scenarios where a user would ask for assistance from their smart glasses, such as "which of these two jackets would look better with this pair of shoes?" or "does this pasta look strained enough to you?". The source videos used to construct this benchmark component were independently collected and are not based on existing publicly available data. To evaluate performance we use LLM-judge accuracy.
In this task, the model must generate a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified time interval, the model is required to output a caption that accurately describes the event occurring within that interval. The test set contains 10060 instances. We report LLM-judge accuracy to assesses the quality of the generated captions.
In this task, the model must identify the precise time interval within the video when the specified event takes place for the given subject. Given a video, a region masklet and a textual description of the event, the model is required to output the start and end timestamps that correspond to the occurrence of the event.
Notably, this task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.
In this task, a model must generate a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the model must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible. We report SODA score, which leverages an LLM judge to assess the quality of the generated captions.
CC BY 4.0
Synthetic video captions and MCQs used in PLM, please refer to the paper, Section 3, for more details. The synthetic annotations cover: YT-1B, Ego4d with captions, YT-1B with MCQAs and Ego4d with QAs.
This data is an output from Llama 3.2, and subject to the Llama 3.2 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Use of the data to train, fine tune, or otherwise improve an AI model, which is distributed or made available, shall also include "Llama" at the beginning of any such AI model name.
Synthetic image captions and QAs used in PLM, please refer to the paper, Section 3, for more details. The synthetic annotations covers: SA1B, Openimages, Object365, ArxivQA, UCSF, PDFAcc.
This data is an output from Llama 3.2, and subject to the Llama 3.2 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE). Use of the data to train, fine tune, or otherwise improve an AI model, which is distributed or made available, shall also include "Llama" at the beginning of any such AI model name.
Our approach
Latest news
Foundational models