Research

Speech & Audio

Large-scale weakly-supervised pre-training for video action recognition

May 1, 2019

Abstract

Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?

Download the Paper

Related Publications

May 12, 2026

Human & Machine Intelligence

NeuralSet: A High-Performing Python Package for Neuro-AI

Corentin Bel, Linnea Evanson, Julien Gadonneix, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Hubert Jacob Banville, Jarod Levy, Jean Remi King, Josephine Raugel, Jérémy Rapin, Katelyn Begany, Marlene Careil, Simon Dahan, Sophia Houhamdi, Stéphane d'Ascoli, Teon Brooks, Yohann Benchetrit

May 12, 2026

March 17, 2026

Speech & Audio

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR Team, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramirez, Jaehyeong Jo, Alexandre Mourachko, Yu-An Chung, Artyom Kozhevnikov, Belen Alastruey, Christophe Ropers, David Dale, Holger Schwenk, João Maria Janeiro, Kevin Heffernan, Loic Barrault, Marta R. Costa-jussa, Paul-Ambroise Duquenne, Pere Lluís Huguet Cabot

March 17, 2026

November 10, 2025

Speech & Audio

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team, Skyler Wang, Ife Adebara, Michael Auli, Kaushik Ram Sadagopan, Zheng-Xin Yong, Albert Ventayol-Boada, Alexandre Mourachko, Alexander Erben, Yu-An Chung, Arina Turkatenko, Artyom Kozhevnikov, Caley Drooff, Can Balioglu, Chierh Cheng, Christophe Ropers, Cynthia Gao, Gabriel Mejia Gonzalez, Gil Keren, Jean Maillard, Joe Chuang, Kehan Lyu, Kevin Chan, Mark Duppenthaler, Mary Williamson, Matthew Setzler, Paul-Ambroise Duquenne, Rashel Moritz, Safiyyah Saleem, Sagar Miglani, Shireen Yates, Vineel Pratap, Yen Meng

November 10, 2025

June 27, 2025

Human & Machine Intelligence

Conversational AI

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Morteza Behrooz, Ning Dong, Jeff Girard, Vasu Sharma, Jan Zikes, Akinniyi Akinyemi, Alex Shcherbyna, Alexander Richard, Alice Rakotoarison, Amia Oberai, Anastasis Stathopoulos, Anna Sun, Antony D'Avirro, Arina Turkatenko, Benjamin Peloquin, Bo Wan, Brandon Han, Carleigh Wood, Chao Wang, Chen Zhang, Christophe Ropers, Christopher Klaiber, Cynthia Gao, Dejan Kovachev, Denise Hernandez, Evonne Ng, Fabian Prada, Fabio Maria Carlucci, Guangyao Ma, Hang Li, Hirofumi Inaguma, Hongyu Gong, Jason Zheng, Jeff Wang, Jie Shen, Jiemin Zhang, Jing Ma, Joe Chuang, Jon Daly, Jovan Popovic, Joy Chen, Juan Pino, Julia Buffalini, Zhiyuan Yao, Junming Chen, Kam-Woh Ng, Kathryn Alvero, Louis-Philippe Morency, Lucas Mantovani, Mark Duppenthaler, Martin Gleize, Martin Ma, Mary Williamson, Michael Zollhoefer, Moneish Kumar, Omid Poursaeed, Paden Tomasello, Pavel Litvin, Pavlo Zhyzheria, Praveen Chowdary, Qingyao Jia, Raj Janardhan, Rongjie Huang, Safiyyah Saleem, Sagar Miglani, Sahir Gomez, Sen He, Shiyang Cheng, Somya Jain, Sreyas Mohan, Srivathsan Govindarajan, Tao Xiang, Tu Anh Nguyen, Tuan Tran, Vasu Agrawal, Wei Liu, Xinyue Zhang, Xutai Ma, Yilei Li, Yilin Yang, Yordan Hristov, Zhang Chen

June 27, 2025

October 10, 2016

Speech & Audio

Computer Vision

Polysemous Codes | Facebook AI Research

Matthijs Douze, Hervé Jégou, Florent Perronnin

October 10, 2016

June 18, 2018

Speech & Audio

Computer Vision

Low-shot learning with large-scale diffusion | Facebook AI Research

Matthijs Douze, Arthur Szlam, Bharath Hariharan, Hervé Jégou

June 18, 2018

July 10, 2018

NLP

Speech & Audio

Hierarchical Text Generation and Planning for Strategic Dialogue | Facebook AI Research

Denis Yarats, Mike Lewis

July 10, 2018

September 08, 2017

NLP

Speech & Audio

Deal or No Deal? End-to-End Learning for Negotiation Dialogues | Facebook AI Research

Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, Dhruv Batra

September 08, 2017

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.