Computer Vision


Research in Brief: Grounded Human-Object Interaction Hotspots


What the research is:

A new approach that teaches AI how to interact with objects by showing it videos of everyday human behavior. Unlike similar interaction-related research that relies on manually annotated examples of various actions (or supervised data), this work proposes that videos of people interacting with objects can function as weakly supervised data. These videos provide many of the cues necessary for the system to understand how to successfully interact with objects. This allows it to extrapolate, improving its ability to recognize objects as well as its understanding of how to interact with new objects that it wasn’t previously trained on.

How it works:

The goal of this research is to teach systems to understand interaction hotspots, the specific part of a given object that best explains how humans interact with it. This process begins with a video action classifier, which is trained to recognize various classes of actions. The resulting model is then trained to anticipate what an object will look like when used, and finally adapted to generate interaction hotspot maps, both for the objects and actions seen in its training, as well as for novel objects. In the experiments, the use of hotspot maps either matched or outperformed related baselines and state-of-the-art systems whose training was significantly more supervised. This approach also demonstrated that interaction hotspots can provide useful hints about object function. By understanding, for example, that objects that swing open are door-like, the trained model can categorize novel objects that interact in similar ways and imagine new ways to interact.

Why it matters:

To bridge the gap between the passive perception of today's image recognition systems and the interactive, embodied capabilities of tomorrow's virtual and robotic assistants, AI must learn not just how the physical world looks, but how it works. In addition to demonstrating how interaction hotspots can encode object function similarity — particularly with limited training — the weakly supervised nature of this research suggests that systems can better understand the relationship between objects and actions through observation. Given the number of actions and objects that are necessary for systems to fully interact with and among people, this general strategy shows promise for the larger goal of training AI-powered agents.

Read the full paper:

Grounded Human-Object Interaction Hotspots from Video