The EgoObjects dataset is designed to push the frontier of first-person and open-world object understanding for improving metaverse AR products.
To further push the limits of egocentric perception, we create the first large-scale data set focused on object detectors for egocentric video — featuring diverse viewpoints as well as different scale, background, and lighting conditions. While most existing comparable data sets are either not object-centric or not large-scale, our initial release will cover over 12,000 videos (40+ hours) across 200 main object categories in over 25 countries. Besides the main objects, the videos also capture various surrounding objects in the background. The total number of object categories can go up to 600.
Data collection is conducted with a wide range of egocentric recording devices (Rayban Stories, Snap Spectacles, and Mobile) in realistic household scenarios. EgoObjects also features an array of rich data annotations, like bounding boxes, category labels, instance IDs, as well as rich meta information, like background description, lighting condition and location.
The EgoObjects challenge version will be used for continual learning challenge at the CLVision workshop at CVPR 2022 with three tracks including continual instance-level object classification, continual category-level object detection, continual instance-level object detection. These tracks are designed to advance object understanding in the egocentric perspective, a fundamental building block for AR applications.
Computer vision, machine learning
Research on continual learning of instance-/category-level object classification/detection. Open-source for CVPR 2022 CLVision workshop.
Image (jpg)
Training, testing
Total number of images: ~100k
Image frame rate: 1 FPS
Number of main object categories: 200
Number of all object categories: up to 600
Labels
Main object category (self-provided):
~100k
Background description (self-provided):
~100k
Location (self-provided):
~100k
2D bounding boxes (human labeled):
~250k
Category labels (human labeled):
~250k
Instance IDs (human labeled):
~250k
Frames from video recordings of indoor objects taken by egocentric cameras
Data for indoor objects only. No person data included
Open access
Data sources
Vendor data collection efforts
Data selection
All images are opted-in for data use in algorithm training and benchmarking by the users
Frames sampled at 1 FPS
Geographic distribution
25 countries: US, ZA, NG, IN, VN, FR, DE, etc
Human labels
Labeling procedure - Human
Vendors provided meta information including main object category, background description, and locations. Annotators labeled 2D bounding boxes with category labels and instance IDs.
Foundational models
Latest news
Foundational models