RESEARCH

COMPUTER VISION

Learning State-Aware Visual Representations from Audible Interactions

November 10, 2022

Abstract

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.

Download the Paper

AUTHORS

Written by

Unnat Jain

Abhinav Gupta

Himangi Mittal

Pedro Morgado

Publisher

NeurIPS

Research Topics

Computer Vision

Related Publications

February 13, 2024

GRAPHICS

COMPUTER VISION

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, Filippos Kokkinos

February 13, 2024

January 25, 2024

COMPUTER VISION

LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks

Felix Xu, Di Lin, Jianjun Zhao, Jianlang Chen, Lei Ma, Qing Guo, Wei Feng, Xuhong Ren

January 25, 2024

November 10, 2023

COMPUTER VISION

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Shuhan Tan, Tushar Nagarajan, Kristen Grauman

November 10, 2023

October 29, 2023

COMPUTER VISION

ALA: Naturalness-aware Adversarial Lightness Attack

Felix Xu, Geguang Pu, Jiayi Zhu, Jincao Feng, Liangru Sun, Qing Guo, Yang Liu, Yihao Huang

October 29, 2023

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.