RESEARCH

COMPUTER VISION

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

June 11, 2025

Abstract

Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair — a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9%, while the best open-source state-of-the-art video-language model achieves 40.2% compared to random performance at 25%.

Download the Paper

AUTHORS

Written by

Benno Krojer

Mojtaba Komeili

Candace Ross

Quentin Garrido

Koustuv Sinha

Nicolas Ballas

Mido Assran

Publisher

arXiv

Research Topics

Computer Vision

Core Machine Learning

Related Publications

February 11, 2026

RESEARCH

COMPUTER VISION

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

February 11, 2026

January 02, 2026

COMPUTER VISION

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

January 02, 2026

December 18, 2025

COMPUTER VISION

We Can Hide More Bits: The Unused Watermarking Capacity in Theory and Practice

Aleksandar Petrov, Pierre Fernandez, Tomáš Souček, Hady Elsahar

December 18, 2025

December 18, 2025

COMPUTER VISION

Learning to Watermark in the Latent Space of Generative Models

Sylvestre Rebuffi, Tuan Tran, Valeriu Lacatusu, Pierre Fernandez, Tomáš Souček, Tom Sander, Hady Elsahar, Alexandre Mourachko

December 18, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.