CONVERSATIONAL AI

NLP

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

August 20, 2024

Abstract

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to STR quality, overall latency, and model inference. In this paper, we delve into those challenges, and discuss the system architecture, design choices, and modeling techniques employed to overcome these obstacles. We also provide a comprehensive evaluation for each component, showcasing high quality and efficiency.

Download the Paper

AUTHORS

Written by

Ashish Shenoy

Yichao Lu

Srihari Jayakumar

Debojeet Chatterjee

Mohsen Moslehpour

Pierce Chuang

Abhay Harpale

Vikas Bhardwaj

Di Xu (SWE)

Shicong Zhao

Ankit Ramchandani

Luna Dong

Anuj Kumar

Publisher

KDD

Related Publications

December 17, 2024

NLP

FLAME : Factuality-Aware Alignment for Large Language Models

Jack Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, Xilun Chen

December 17, 2024

December 12, 2024

NLP

CORE MACHINE LEARNING

Memory Layers at Scale

Vincent-Pierre Berges, Barlas Oguz

December 12, 2024

December 12, 2024

NLP

Byte Latent Transformer: Patches Scale Better Than Tokens

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srini Iyer

December 12, 2024

December 12, 2024

HUMAN & MACHINE INTELLIGENCE

NLP

Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning

Melanie Sclar, Jane Yu, Maryam Fazel-Zarandi, Yulia Tsvetkov, Yonatan Bisk, Yejin Choi, Asli Celikyilmaz

December 12, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.