NLP

COMPUTER VISION

Large-scale Pretraining for Visual Dialog:A Simple State-of-the-Art Baseline

July 15, 2020

Abstract

Prior work in visual dialog has focused on training deep neural models on VisDial [1] in isolation. Instead, we present an approach to leverage pretraining on related vision-language datasets before transferring to visual dialog. We adapt the recently proposed ViLBERT model [2] for multi-turn visually-grounded conversations. Our model is pretrained on the Conceptual Captions [3] and Visual Question Answering [4] datasets, and finetuned on VisDial. Our best single model outperforms prior published work by >1% absolute on NDCG and MRR. Next, we find that additional finetuning using “dense” annotations in VisDial leads to even higher NDCG – more than 10% over our base model – but hurts MRR – more than 17% below our base model! This highlights a trade-off between the two primary metrics – NDCG and MRR – which we find is due to dense annotations not correlating well with the original ground-truth answers to questions.

Download the Paper

AUTHORS

Written by

Devi Parikh

Abhishek Das

Dhruv Batra

Vishvak Murahari

Publisher

ECCV

Related Publications

May 12, 2026

HUMAN & MACHINE INTELLIGENCE

RESEARCH

NeuralSet: A High-Performing Python Package for Neuro-AI

Corentin Bel, Linnea Evanson, Julien Gadonneix, Andrea Santos Revilla, Mingfang (Lucy) Zhang, Julie Bonnaire, Charlotte Caucheteux, Alexandre Défossez, Théo Desbordes, Pablo Diego-Simón, Shubh Khanna, Juliette Millet, Pierre Orhan, Saarang Panchavati, Antoine Ratouchniak, Alexis Thual, Hubert Jacob Banville, Jarod Levy, Jean Remi King, Josephine Raugel, Jérémy Rapin, Katelyn Begany, Marlene Careil, Simon Dahan, Sophia Houhamdi, Stéphane d'Ascoli, Teon Brooks, Yohann Benchetrit

May 12, 2026

May 04, 2026

NLP

Compute Optimal Tokenization

Sachin Mehta, Alisa Liu, Margaret Li, Artidoro Pagnoni, Gargi Ghosh, Luke Zettlemoyer, Mike Lewis, Srini Iyer, Tomasz Limisiewicz

May 04, 2026

April 14, 2026

COMPUTER VISION

ML APPLICATIONS

TransText: Transparency Aware Image-to-Video Typography Animation

Zijian Zhou, Bohao Tang, Pengfei Liu, Fei Zhang, Frost Xu, Hang Li (BizAI), Semih Gunel, Sen He, Soubhik Sanyal, Tao Xiang, Viktar Atliha, Zhe Wang

April 14, 2026

April 09, 2026

HUMAN & MACHINE INTELLIGENCE

COMPUTER VISION

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang, Junjiao Tian, Kunpeng Li, Jialiang Wang, Weifeng Chen, Yuxiao Bao, Julian McAuley, Manling Li, Zecheng He, Felix Xu, Markos Georgopoulos, Zhipeng Fan

April 09, 2026

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.