Ego4D is a collaborative project, seeking to advance the fundamental AI research needed for multimodal machine perception for first-person video understanding.
Today’s perception systems excel at detecting and labeling objects in individual Internet photos or videos. In contrast, first-person or “egocentric” perception requires understanding ongoing sensory data (images, video, audio, and motion) as it streams to a person’s wearable, head-mounted device. It demands the integration of this multimodal data with 3D understanding of physical environments, social contexts, and human-object interactions. Furthermore, whereas users today actively take their photos—framing them intentionally to convey a message or capture a memory—images collected by wearable cameras lack this curation, presenting a much greater challenge for algorithms trying to understand them. Motivated by these contrasts, Facebook AI brought together 13 universities and academic research organizations from around the world to embark on an ambitious, long-term project, called “Egocentric Live 4D Perception” (Ego4D). The project is designed to spur egocentric research outside and inside of the company.
In collaboration with these universities and Facebook Reality Labs Research (FRL), Facebook AI is releasing five AI benchmarks that were collectively developed for academics, researchers, and developers to leverage in their work to advance the fundamental AI technology needed to build more useful AI assistants and home robots of the future.
The benchmarks include:
Episodic Memory: Given an egocentric video and a query, the Episodic Memory tasks requires localizing where the answer can be seen within the user’s past video
Hands and Objects: Hands and Objects tasks captures how the camera wearer changes the state of an object by using or manipulating it
Audio-Visual Diarization: The Audio-Visual Diarization benchmark is composed of four tasks: 1) localizing and tracking of speakers in a visual field of view, 2) active speaker detection, 3) diarization of speaker activity, 4) transcription of speech content
Social Interactions: The Social benchmark focuses on multimodal understanding of conversational interactions
Forecasting: The Forecasting benchmark includes four tasks: 1) locomotion prediction, 2) hand movement prediction, 3) short-term object interaction anticipation, and 4) long-term action anticipation
Progress in this field requires large volumes of first-person data that has the scale, diversity, and complexity necessary to be useful in the real world. As part of Ego4D, our University partners collected thousands of hours of first-person unscripted video data with more than 700 research participants capturing hundreds of daily-life scenarios around the world. The participants vary across ages, demographics, and genders, and span 9 different countries, using off-the-shelf, head-mounted camera devices. This data will be available to the public research community later this year.
As a supplement to this work, researchers from Facebook Reality Labs used Vuzix Blade Glasses to collect an additional 400 hours of fully-consented, first-person video data in staged environments in our research labs.
Carnegie Mellon University (CMU) and CMU-Africa
Georgia Institute of Technology
Indiana University
Massachusetts Institute of Technology
University of Minnesota
University of Pennsylvania
University of Catania
University of Bristol
University of Tokyo
International Institute of Information Technology, Hyderabad
King Abdullah University of Science and Technology
National University of Singapore
University of Los Andes
Foundational models
Latest news
Foundational models