SPEECH & AUDIO

COMPUTER VISION

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

December 16, 2025

Abstract

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on Perception Encoder, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio–video, audio–text, and video–text modalities. PE-AV’s unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio–video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects—avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection. Code: https://github.com/facebookresearch/perception_models. Model: https://huggingface.co/collections/facebook/perception-encoder-audio-visual

Download the Paper

AUTHORS

Written by

Apoorv Vyas

Heng-Jui Chang

Cheng-Fu Yang

Bernie Huang

Luya Gao

Julius Richter

Sanyuan Chen

Matt Le

Piotr Dollar

Christoph Feichtenhofer

Ann Lee

Wei-Ning Hsu

Publisher

arXiv

Related Publications

February 11, 2026

RESEARCH

COMPUTER VISION

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

February 11, 2026

January 02, 2026

COMPUTER VISION

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

January 02, 2026

December 18, 2025

COMPUTER VISION

We Can Hide More Bits: The Unused Watermarking Capacity in Theory and Practice

Aleksandar Petrov, Pierre Fernandez, Tomáš Souček, Hady Elsahar

December 18, 2025

December 18, 2025

COMPUTER VISION

Learning to Watermark in the Latent Space of Generative Models

Sylvestre Rebuffi, Tuan Tran, Valeriu Lacatusu, Pierre Fernandez, Tomáš Souček, Tom Sander, Hady Elsahar, Alexandre Mourachko

December 18, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.