March 3, 2022
Last year, Meta AI, together with a consortium of 13 universities and labs across nine countries, launched an ambitious long-term research project called Egocentric Live 4D Perception (Ego4D), which aims to train AI to understand and interact with the world like we do, from a first-person (or egocentric) perspective. As wearable devices like augmented and virtual reality headsets continue to improve and begin to power new experiences in the metaverse, AI will need to learn from entirely different data than what’s shown in typical videos filmed from handheld cameras. It must understand the world through human eyes in the context of continuous visual, motion, audio, and behavioral cues.
Ego4D features the world’s largest data set of more than 3,000 hours of egocentric videos, combined with five new research benchmarks, and is designed to push the frontier of first-person perception.
Today, together with the consortium, we’re announcing the public release of the data set and annotations, and the launch of a new large-scale public competition, which will run from March through October this year. We’ll declare the first round of winners at the Joint International 1st Ego4D and 10th EPIC Workshop at CVPR 2022. This set of large-scale challenges entails 16 egocentric tasks, including querying episodic memories, analyzing hand-object manipulations, audio-visual conversation, social interactions, and forecasting the camera wearer’s future activity. The Ego4D challenge begins today with a subset of six competition tracks – the remaining 10 tracks will be launched in the weeks ahead.
As a complement to these efforts, Meta AI is also releasing EgoObjects, the largest object-centric data set containing more than 110 hours of egocentric videos and focusing on object detection tasks. It includes 40,000 videos, and 1.2M object annotations from up to 600 object categories. This is part of the CLVision workshop at CVPR 2022, which supports the research development of continual learning of object detection at both category- and instance level.
We hope these resources and competitions will inspire AI researchers to improve our baselines, push the state of the art, and join our efforts in advancing egocentric perception. By collaborating and openly sharing our work, we hope to accelerate progress in this important research area.
Ego4D data set Ego4D challenge participants will use Ego4D’s unique annotated data set of more than 3,670 hours of video data, capturing the daily-life scenarios of more than 900 unique individuals from nine different countries around the world. Ego4D footage is unscripted and “in the wild.” To build the data set, each university team was responsible for complying with its own institutional research policy. The process involved developing a study protocol compliant with standards from institutional research ethics committees and/or review boards, including a process to obtain informed consent and/or video release from participants.
It showcases daily activities encompassing life around the world. Portions of this data are accompanied by audio, 3D environment meshes, eye gaze, stereo, and synchronized video from multiple head-mounted cameras. Ego4D is richly annotated, offering temporal, spatial, and semantic labels, natural language queries, and speech transcription. It contains annotations built on textual summaries of the entire data set, with annotators describing actions appearing in the data set roughly every five seconds, resulting in more than 3.85M unique sentences.
There will be 16 independent competition tracks, all leveraging the Ego4D data sets and benchmark suite. For each track, challenge entrants will use Ego4D to train models to perform our benchmark tasks and compete to achieve the highest accuracy scores. After a period of validation, the highest scoring teams will receive cash prizes.
Episodic memory challenge: What did I do? Developing superhuman memory, indexing egocentric videos, and answering queries about past objects, activities, or locations.
Challenge participants will build systems responsive to three query types:
Visual queries with 2D and 3D localization: Given an egocentric video clip and an image crop depicting the query object, the goal is to return the last time the object was seen in the input video, in terms of the tracked bounding box (2D + temporal localization) and the 3D displacement vector from the camera to the object in the environment.
Natural language queries: Given a video clip and a query expressed in natural language, the goal is to localize the temporal window within all the video history where the answer to the question is evident. So, you could ask an AI system: “What did I pick up before leaving the party?”
Moments queries Audio-visual diarization: Who said what when? (e.g., “What was the main topic during class?”). Given an egocentric video and an activity name (e.g., a “moment”), the goal is to localize all instances of that activity in the past video, like “When are all the times I drank coffee today?”
Hands and objects challenge: What am I doing now? Understanding the present by detecting and classifying moments when camera wearers change the state of an object they are manipulating.
Challenge participants will address two tasks:
Temporal localization and classification: Given an egocentric video clip, the goal is to localize temporally the key frames that indicate an object state change (e.g., chopping a tomato or assembling pieces of wood), and identify what kind of state change it is.
State change object detection: Given an egocentric video clip, the goal is to identify the objects whose states are changing and outline them with bounding boxes.
Audio-visual diarization and social challenges: Which person said what, when? Which person is talking to me? Which person is looking at me? Understanding spoken language and social interactions from the egocentric perspective using the simultaneous capture of video and audio.The audio-visual diarization challenge enables the understanding of the discourse of conversations, while the audio-visual social challenge enables embodied approaches to understand social behaviors and settings. We begin this process with efforts to identify communicative acts and their content, as well as attention directed toward the camera wearer.
Challenge participants will address these problems through the following tasks:
Audio-visual localization:Given an egocentric video clip, the goal is to identify which person spoke and when they spoke.
Speech transcription: Given an egocentric video clip, the goal is to automatically transcribe the speech of each person.
Talking to me: Given an egocentric video clip, the goal is to identify whether someone in the scene is talking to the camera wearer.
Looking at me: Given an egocentric video clip, the goal is to identify whether someone in the scene is looking at the camera wearer.
As organizers of the audio-visual diarization and social challenges, we’re aware of the privacy implications of their long-term research agendas. We feel that Ego4D’s effort to fundamentally advance audio and social understanding are necessary precursors to developing future safeguards.
To support these efforts, we are integrating a privacy-oriented challenge into our 2022 audio-visual diarization and social challenges. We’re calling on the global research community to review audio-visual recordings of conversations and interactions along with their annotations and propose novel methods to use these resources for privacy-preserving research. Challenge participants can submit a research prospectus, and those considered most practical or impactful will receive prize funding to support their work. Areas can include but are not limited to masking personal identifiable information (PII) in audio-visual data or technologies for understanding your audience (i.e., people that you are conversing with versus bystanders). We hope crowdsourcing research concepts and ideas will generate critical momentum as Ego4D launches its own efforts in these areas. The privacy-oriented challenge will formally launch later this month.
Forecasting challenge: What will I do next? Anticipating the camera wearers’ next movements and interactions. This challenge will be important for building more useful applications for AI-powered assistants, like teaching people how to play the drums or warning that someone has already added salt to a recipe as you reach for the salt shaker. Our challenge focuses on:
Locomotion forecasting: Given a video frame and the past trajectory, the goal is to predict the future ego positions of the camera wearer (in the form of a 3D trajectory).
Hand forecasting: Given a video clip, the goal is to predict the next active objects, the next action, and the time to contact.
Long-term activity prediction: Given a video clip, the goal is to predict what sequence of activities will happen in the future? For example, after kneading dough, what will the baker do next?
Please review the submission guidelines before entering and note that participants must submit their submissions to EvalAI. The winning team from each track will be invited to nominate a team member to share their work at a CVPR 2022 event, where we will also share the challenge leaderboards.
Partners in the 2022 Ego4D Challenge:
Carnegie Mellon University (Pittsburgh, Kigali)
Georgia Institute of Technology
Indiana University Bloomington
Massachusetts Institute of Technology
University of Minnesota
University of Pennsylvania
University of Catania
University of Bristol
University of Tokyo
International Institute of Information Technology, Hyderabad
King Abdullah University of Science and Technology
National University of Singapore
University of Los Andes
University of California, Berkeley
Meta AI Research
To further push the limits of egocentric perception, we’ve also created the first large-scale data set focused on object detectors for egocentric video — featuring diverse viewpoints as well as different scale, background, and lighting conditions. While most existing comparable data sets are either not object-centric or not large-scale, our initial release will cover over 40,000 videos (110+ hours) across 200 main object categories in over 25 countries. Besides the main objects, the videos also capture various surrounding objects in the background. The total number of object categories can go up to 600.
Data collection is conducted with a wide range of egocentric recording devices (Rayban Stories, Snap Spectacles, Aria glasses, and Mobile) in realistic household scenarios. EgoObjects also features an array of rich data annotations, like bounding boxes, category labels, instance IDs, as well as rich meta information, like background description, lighting condition and location.
The EgoObjects dataset will be used for continual learning challenge at the CLVision workshop at CVPR 2022 with three tracks including
Continual instance-level object classification. This task is to handle a stream of training experiences containing images of common household or workplace objects. The solution will be able to access the ground-truth label of each training image in order to incrementally train its internal knowledge model (fully supervised). Images will depict a single object and the expected prediction is a classification label. The solution must return predictions at the instance level. That is, solutions will need to disentangle between objects belonging to common categories.
Continual category-level object detection. In this task, incremental experiences will carry short videos of common household or workplace objects. Objects will be depicted in common household and workplace environments, with each image depicting more than one object. The goal is to predict the bounding box and label of the depicted objects. This can be a very good starting point when first approaching continual detection tasks.
Continual instance-level object detection. In this task, incremental experiences will carry short videos of common household or workplace objects. Differently from its category-level counterpart, the goal is to predict the object labels at the instance level. Each video will feature a single “reference” object (possibly surrounded by other unrelated objects). The goal is to predict the position and instance label of that reference object. This task is harder than the category-level counterpart.
These tracks are designed to advance object understanding in the egocentric perspective, a fundamental building block for AR applications. Some examples include but are not limited to object search for both visible or hidden items, object-anchored contextual reminder, automatic grocery list building and object-anchored immersive virtual content.
Besides the continual learning of detecting 2D objects, EgoObjects also supports research on 3D object detection and pose estimation given that the data set also has 3D annotations. In this task, the solution has access to 2D RGB images and oriented 3D bounding boxes of objects as the ground truth. The goal is to predict the 3D bounding boxes with correct pose from a single 2D image.