A benchmark and dataset for systematic investigation of vision-language models on compositional, causal (e.g., effect of actions), and temporal (e.g., action ordering) reasoning in egocentric settings.
To enable progress towards egocentric agents capable of reasoning about everyday tasks specified in natural language, this benchmark was introduced in our ICCV 2023 paper “EgoTV: Egocentric Task Verification from Natural Language Task Descriptions.” The objective in EgoTV is to verify an agent’s execution of tasks from its egocentric videos based on the natural language description of these tasks.
EgoTV contains pairs of synthetic ego-videos containing an agent’s task execution and associated task descriptions for multi-step tasks – these tasks contain multiple sub-task decompositions, state changes, object interactions, and subtask ordering constraints. In addition, EgoTV also provides abstracted task descriptions that contain only partial details about ways to accomplish a task. Each pair of natural language task description and ego-video in EgoTV is accompanied by a label indicating whether the video shows the task being performed as described (positive label) or not (negative label). Consequently, EgoTV requires causal, temporal, and compositional reasoning of video and language modalities.
Egocentric Agents, Vision-Language Task Tracking and Verification
The objective is to determine if an agent correctly executes a task described in natural language from its egocentric video.
Each task consists of multiple partially ordered sub-tasks (heat, clean, slice, cool, place, pick) with ordering constraints (and, then, before, after) instantiated on an object of interaction. For instance, heat_then_clean(apple)
Metrics:
Task difficulty is measured using Complexity (# sub-tasks in a task) and Ordering (# ordering constraints in a task)
Model performance is measured using Accuracy and F1-Score.
Generalization Splits:
Novel Tasks (Unseen compositions of seen sub-tasks)
Novel Steps (Unseen affordances)
Novel Scenes (Unseen environments)
Abstraction (High-level task definitions)
Novel Tasks: 540, Novel Steps: 350, Novel Scenes: 1082, Abstraction: 338
7,673 samples (train set: 5,363; test set: 2,310)
168 hours, 82 tasks, 1038 task-object combinations
average video length of 84 seconds
4.6 sub-tasks per task in the EgoTV dataset, each sub-task spans ~ 14 frames
~2.4 ways to verify a task from NL description
EgoTV tasks are specified using Planning Domain Definition Language (PDDL) and viable plans that fulfill goal conditions and respect order constraints are generated with the Metric-FF planner. The corresponding videos are recorded by executing the plans in the AI2-THOR simulator.
EgoTV dataset is licensed under CC-BY-NC.
Open-sourced!
2023 International Conference on Computer Vision (ICCV), Paris.
Project Page: https://rishihazra.github.io/EgoTV
Code: https://github.com/facebookresearch/EgoTV
Paper: https://arxiv.org/abs/2303.16975
Foundational models
Latest news
Foundational models