Today we are announcing Ego-Exo4D, a foundational dataset and benchmark suite to support research on video learning and multimodal perception. The result of a two-year effort by Meta’s FAIR (Fundamental Artificial Intelligence Research), Meta’s Project Aria, and 15 university partners, the centerpiece of Ego-Exo4D is its simultaneous capture of both first-person “egocentric” views, from a participant’s wearable camera, as well as multiple “exocentric” views, from cameras surrounding the participant. The two perspectives are complementary. While the egocentric perspective reveals what the participant sees and hears, the exocentric views reveal the surrounding scene and the context. Together, these two perspectives give AI models a new window into complex human skill.
Working together as a consortium, FAIR or university partners captured these perspectives with the help of more than 800 skilled participants in the United States, Japan, Colombia, Singapore, India, and Canada. In December, the consortium will open source the data (including more than 1,400 hours of video) and annotations for novel benchmark tasks. Additional details about the datasets can be found in our technical paper. Next year, we plan to host a first public benchmark challenge and release baseline models for ego-exo understanding. Each university partner followed their own formal review processes to establish the standards for collection, management, informed consent, and a license agreement prescribing proper use. Each member also followed the Project Aria Community Research Guidelines. With this release, we aim to provide the tools the broader research community needs to explore ego-exo video, multimodal activity recognition, and beyond.
How Ego-Exo4D works
Ego-Exo4D focuses on skilled human activities, such as playing sports, music, cooking, dancing, and bike repair. Advances in AI understanding of human skill in video could facilitate many applications. For example, in future augmented reality (AR) systems, a person wearing smart glasses could quickly pick up new skills with a virtual AI coach that guides them through a how-to video; in robot learning, a robot watching people in its environment could acquire new dexterous manipulation skills with less physical experience; in social networks, new communities could form based on how people share their expertise and complementary skills in video.
Such applications demand the ability to move fluidly between the exo and ego views. For example, imagine watching an expert repair a bike tire, juggle a soccer ball, or fold an origami swan—then being able to map their steps to your own actions. Cognitive science tells us that even from a very young age we can observe others’ behavior (exo) and translate it onto our own (ego).
Realizing this potential, however, is not possible using today's datasets and learning paradigms. Existing datasets comprised of both ego and exo views (i.e., ego-exo) are few, small in scale, lack synchronization across cameras, and/or are too staged or curated to be resilient to the diversity of the real world. As a result, the current literature for activity understanding primarily covers only the ego or exo view, leaving the ability to move fluidly between the first- and third-person perspectives out of reach.
Ego-Exo4D constitutes the largest public dataset of time-synchronized first- and third- person video. Building this dataset required the recruitment of specialists across varying domains, bringing diverse groups of people together to create a multifaceted AI dataset. All scenarios feature real-world experts, where the camera-wearer participant has specific credentials, training, or expertise in the skill being demonstrated. For example, among the Ego-Exo4D camera wearers are professional and college athletes; jazz, salsa, and Chinese folk dancers and instructors; competitive boulderers; professional chefs who work in industrial-scale kitchens; and bike technicians who service dozens of bikes per day.
Ego-Exo4D is not only multiview, it is also multimodal. Captured with Meta’s unique Aria glasses, all ego videos are accompanied by time-aligned seven channel audio, inertial measurement units (IMU), and two wide-angle grayscale cameras, among other sensors. All data sequences also provide eye gaze, head poses, and 3D point clouds of the environment through Project Aria’s state-of-the-art machine perception services. Additionally, Ego-Exo4D provides multiple new video-language resources:
All three language corpora are time-stamped against the video. With these novel video-language resources, AI models could learn about the subtle aspects of skilled human activities. To our knowledge, there is no prior video resource with such extensive and high quality multimodal data.
Alongside the data, we introduce benchmarks for foundational tasks for ego-exo video to spur the community's efforts. We propose four families of tasks:
We provide high quality annotations for training and testing each task—the result of more than 200,000 hours of annotator effort. To kickstart work in these new challenges, we also develop baseline models and report their results. We plan to host a first public benchmark challenge in 2024.
Collaboratively building on this research
The Ego4D consortium is a long-running collaboration between FAIR and more than a dozen universities around the world. Following the 2021 release of Ego4D, this team of expert faculty, graduate students, and industry researchers reconvened to launch the Ego-Exo4D effort. The consortium’s strengths are both its collective AI talent as well as its breadth in geography, which facilitates recording data in a wide variety of visual contexts. Overall, Ego-Exo4D includes video from six countries and seven U.S. states, offering a diverse resource for AI development. The consortium members and FAIR researchers collaborated throughout the project, from developing the initiative’s scope, to each collecting unique components of the dataset, to formulating the benchmark tasks. This project also marks the single largest coordinated deployment of the Aria glasses in the academic research community, with partners at 12 different sites using them.
In releasing this resource of unprecedented scale and variety, the consortium aims to supercharge the research community on core AI challenges in video learning. As this line of research advances, we envision a future where AI enables new ways for people to learn new skills in augmented reality and mixed reality (AR/MR), where how-to videos come to life in front of the user, and the system acts as a virtual coach to guide them through a new procedure and offer advice on how to improve. Similarly, we hope it will enable robots of the future that gain insight about complex dexterous manipulations by watching skilled human experts in action. Ego-Exo4D is a critical stepping stone to enable this future, and we can’t wait to see what the research community creates with it.
Kristen Grauman and Andrew Westbury contributed to this blog.
We’d also like to acknowledge the contributions of researchers at Carnegie Mellon University (CMU) and CMU-Africa, Georgia Institute of Technology, Indiana University, University of Minnesota, University of Pennsylvania, University of Catania, University of Bristol, University of Tokyo, International Institute of Information Technology, Hyderabad, King Abdullah University of Science and Technology, National University of Singapore, University of Los Andes, Simon Fraser University, the University of Texas at Austin, and the University of North Carolina at Chapel Hill.