RESEARCH

NLP

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

October 31, 2019

Abstract

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs. In a subsequent step, this (learned) representation is aligned with the caption. Our key contribution lies in building this `caption-conditioned' image encoding which tightly couples both the tasks and allows the weak supervision to effectively guide visual grounding. We provide an extensive empirical and qualitative analysis to investigate the different components of our proposed model and compare it with competitive baselines. For phrase localization, we report an improvement of 4.9% (absolute) over the prior state-of-the-art on the VisualGenome dataset. We also report results that are at par with the state-of-the-art on the downstream caption-to-image retrieval task on COCO and Flickr30k datasets.

Download the Paper

AUTHORS

Written by

Devi Parikh

Ajay Divakaran

Anirban Roy

Karan Sikka

Karuna Ahuja

Samyak Datta

Publisher

ICCV

Related Publications

June 05, 2026

CONVERSATIONAL AI

RANKING AND RECOMMENDATIONS

Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

June 05, 2026

May 26, 2026

HUMAN & MACHINE INTELLIGENCE

THEORY

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Josephine Raugel, Max Seitzer, Marc Szafraniec, Huy V. Vo, Jérémy Rapin, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean Remi King

May 26, 2026

May 20, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Eric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

May 20, 2026

May 18, 2026

CONVERSATIONAL AI

RESEARCH

GIM: Evaluating models via tasks that integrate multiple cognitive domains

Rohit Patel, Alexandre Rezende, Steven McClain

May 18, 2026

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.