CORE MACHINE LEARNING

An Introduction to Vision-Language Modeling

June 05, 2024

Abstract

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

Download the Paper

AUTHORS

Written by

Anurag Ajay

Alexander C. Li

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Jun Chen

Mazda Moayeri

Aishwarya Agrawal

Florian Bordes

Adrien Bardes

Arjang Talattof

Asli Celikyilmaz

Bargav Jayaraman

Candace Ross

Chuan Guo

Diane Bouchacourt

Ellen Tan

Haider Al-Tahan

Hu Xu

Jonathan Lebensold

Kamalika Chaudhuri

Karen Ullrich

Karthik Padthe

Kate Saenko

Kushal Tirumala

Mark Ibrahim

Megan Richards

Melissa Hall

Oscar Mañas

Pietro Astolfi

Quentin Garrido

Reyhane Askari

Richard Pang

Rim Assouel

Samuel Lavoie

Srihari Jayakumar

Vasu Sharma

Vikas Chandra

Xilun Chen

Yunyang Xiong

Zechun Liu

Publisher

arXiv

Research Topics

Core Machine Learning

Related Publications

November 18, 2025

RESEARCH

CORE MACHINE LEARNING

Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Roberta Raileanu, * Equal authorship, Alexis Audran-Reiss, Amar Budhiraja *, Anton Protopopov, Bhavul Gauri, Despoina Magka, Gaurav Chaurasia, Michael Slater, Shalini Maiti *, Tatiana Shavrina, Yoram Bachrach

November 18, 2025

October 13, 2025

REINFORCEMENT LEARNING

RESEARCH

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Paria Rashidinejad, Cai Zhou, Tommi Jaakkola, DiJia Su, Bo Liu, Feiyu Chen, Chenyu Wang, Shannon Zejiang Shen, Sid Wang, Siyan Zhao, Song Jiang, Yuandong Tian

October 13, 2025

September 24, 2025

RESEARCH

NLP

CWM: An Open-Weights LLM for Research on Code Generation with World Models

Chris Cummins, Hugh Leather, Aram Markosyan, Matteo Pagliardini, Tal Remez, Volker Seeker, Marco Selvi, Lingming Zhang, Abhishek Charnalia, Alex Gu, Badr Youbi Idrissi, Christian Keller, Daniel Haziza, David Zhang, Dmitrii Pedchenko, Emily McMilin, Fabian Gloeckle, Felix Kreuk, Francisco Massa, François Fleuret, Gabriel Synnaeve, Gal Cohen, Gallil Maimon, Jacob Kahn, Jade Copet, Jannik Kossen, Jonas Gehring, Jordi Armengol-Estape, Juliette Decugis, Keyur Muzumdar, Kunhao Zheng, Luca Wehrstedt, Maximilian Beck, Michael Hassid, Michel Meyer, Naila Murray, Oren Sultan, Ori Yoran, Pedram Bashiri, Peter O'Hearn, Pierre Chambon, Pierre-Emmanuel Mazaré, Quentin Carbonneaux, Rahul Kindi, Sida Wang, Taco Cohen, Vegard Mella, Yossi Adi, Yuxiang Wei, Zacharias Fisches

September 24, 2025

August 22, 2025

CORE MACHINE LEARNING

Deep Think with Confidence

Jiawei Zhao, Xuewei Wang, Yichao Fu, Yuandong Tian

August 22, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.