COMPUTER VISION

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

April 17, 2025

Abstract

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.

Download the Paper

AUTHORS

Written by

Yale Song

Hanoona Rasheed

Miguel Martin

Huiyu Wang

Salman Khan

Philipp Krähenbühl

Lorenzo Torresani

Kristen Grauman

Andrea Madotto

Andrew Westbury

Babak Damavandi

Po-Yao Huang

Christoph Feichtenhofer

Daniel Bolya

Effrosyni Mavroudi

Muhammad Maaz

Nikhila Ravi

Peize Sun

Piotr Dollar

Shane Moon

Shashank Jain

Shuming Hu

Suyog Jain

Tammy Stark

Tengyu Ma

Triantafyllos Afouras

Tushar Nagarajan

Jang Hyun Cho

Vivian Lee

Publisher

arXiv

Research Topics

Computer Vision

Related Publications

May 12, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

NeuralSet: A High-Performing Python Package for Neuro-AI

Corentin Bel, Linnea Evanson, Julien Gadonneix, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Hubert Jacob Banville, Jarod Levy, Jean Remi King, Josephine Raugel, Jérémy Rapin, Katelyn Begany, Marlene Careil, Simon Dahan, Sophia Houhamdi, Stéphane d'Ascoli, Teon Brooks, Yohann Benchetrit

May 12, 2026

April 14, 2026

COMPUTER VISION

ML APPLICATIONS

TransText: Transparency Aware Image-to-Video Typography Animation

Zijian Zhou, Bohao Tang, Pengfei Liu, Fei Zhang, Frost Xu, Hang Li (BizAI), Semih Gunel, Sen He, Soubhik Sanyal, Tao Xiang, Viktar Atliha, Zhe Wang

April 14, 2026

April 09, 2026

HUMAN & MACHINE INTELLIGENCE

COMPUTER VISION

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang, Junjiao Tian, Kunpeng Li, Jialiang Wang, Weifeng Chen, Yuxiao Bao, Julian McAuley, Manling Li, Zecheng He, Felix Xu, Markos Georgopoulos, Zhipeng Fan

April 09, 2026

February 27, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

Unified Vision–Language Modeling via Concept Space Alignment

Yifu Qiu, Holger Schwenk, Paul-Ambroise Duquenne

February 27, 2026

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.