Research

Computer Vision

Detecting and Recognizing Human-Object Interactions

June 18, 2018

Abstract

To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting <human, verb, object> triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person – their pose, clothing, action – is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

Download the Paper

Related Publications

October 18, 2025

NLP

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero Soriano, Michal Drozdzal, Aishwarya Agrawal

October 18, 2025

September 23, 2025

NLP

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interactions

Zilin Xiao, Qi Ma, Mengting Gu, Jason Chen, Xintao Chen, Vicente Ordonez, Vijai Mohan

September 23, 2025

August 14, 2025

Computer Vision

DINOv3

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Theo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herve Jegou, Patrick Labatut, Piotr Bojanowski

August 14, 2025

August 13, 2025

Human & Machine Intelligence

Disentangling the Factors of Convergence between Brains and Computer Vision Models

Josephine Raugel, Marc Szafraniec, Huy V. Vo, Camille Couprie, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean Remi King

August 13, 2025

June 11, 2019

Computer Vision

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero | Facebook AI Research

Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinkerton, Larry Zitnick

June 11, 2019

April 30, 2018

NLP

Computer Vision

Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent | Facebook AI Research

Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H. Miller, Arthur Szlam, Douwe Kiela, Jason Weston

April 30, 2018

October 10, 2016

Speech & Audio

Computer Vision

Polysemous Codes | Facebook AI Research

Matthijs Douze, Hervé Jégou, Florent Perronnin

October 10, 2016

June 18, 2018

Speech & Audio

Computer Vision

Low-shot learning with large-scale diffusion | Facebook AI Research

Matthijs Douze, Arthur Szlam, Bharath Hariharan, Hervé Jégou

June 18, 2018

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.