Takeaways:
Imagine an embodied AI agent that acts as the brain of a home robot or a stylish pair of smart glasses. Such an agent needs to leverage sensory modalities like vision to understand its surroundings and be capable of communicating in clear, everyday language to effectively assist people. This is akin to building a “world model”—an agent’s internal representation of the external world, that is queryable through language. It’s a long-term vision and a daunting research challenge—one that Meta is actively exploring.
Today, we’re introducing the Open-Vocabulary Embodied Question Answering (OpenEQA) framework—a new benchmark to measure an AI agent’s understanding of its environment by probing it with open-vocabulary questions. This is similar to how we might assess a human’s understanding of a concept, by asking them questions and evaluating their answers. OpenEQA contains two tasks: (1) episodic memory EQA, in which an embodied AI agent answers questions based on its recollection of past experiences, and (2) active EQA, in which the agent must take action within the environment to gather necessary information and answer questions.
EQA has direct applications too, and even a basic version of it can simplify your everyday life. For example, let’s say you’re getting ready to leave the house and can’t find your office badge. You could ask your smart glasses where you left it, and the agent might respond that the badge is on the dining table by leveraging its episodic memory. Or if you were hungry on the way back home, you could ask your home robot if there’s any fruit left. Based on its active exploration of the environment, it might respond that there are ripe bananas in the fruit basket. See the video at the top of this post to see EQA in action.
Sounds simple enough, right? After all, LLMs have excelled in tasks many people find challenging, like passing the SAT or bar exams. But the reality is that even today’s most advanced models struggle to match human performance when it comes to EQA, another manifestation of Moravec’s paradox. That’s why we’re also releasing the OpenEQA benchmark, so researchers can test their own models and see how they stack up against humans.
Why EQA? From “word models” to “world models”
We’ve seen exciting developments in the space of large language models (LLMs), which seem to have captured a basic linguistic understanding of the world. LLMs can answer all kinds of questions based on their historical knowledge, but they have no idea what is currently going on in the world around them. By enhancing LLMs with the ability to “see” the world and situating them in a user’s smart glasses or on a home robot, we can open up new applications and add value to people’s lives.
It’s an exciting problem statement because, as Jitendra Malik puts it, it showcases the difference between building world models, and word models. In other words, rather than simply predicting the next token in a string, an embodied AI agent that excels at EQA would show that it’s grounded in an understanding of the physical world. Such world models are an important step toward our vision of artificial general intelligence (AGI).
To that end, EQA is a tool that probes if an AI agent really understands what is going on in the world around them. After all, when we want to determine how well a human understands a concept, we ask them questions and form an assessment based on their answers. We can do the same with embodied AI agents.
OpenEQA: A novel benchmark for Embodied AI
OpenEQA is the first open-vocabulary benchmark for EQA, which we believe will help researchers track future progress in multimodal learning and scene understanding. The benchmark features over 1,600 non-templated pairs of questions and answers from human annotators that are representative of real-world use cases, as well as pointers to more than 180 videos and scans of physical environments. Our question-and-answer pairs were validated by different human annotators to ensure that the questions are answerable and the answers provided are correct.
OpenEQA also comes equipped with LLM-Match, an automatic evaluation metric for scoring open vocabulary answers. In fact, through blind user studies, we found that LLM-Match is as correlated to humans as two humans are to each other.
We used OpenEQA to benchmark several state-of-art vision+language foundation models (VLMs) and found a significant gap between even the most performant models (GPT-4V at 48.5%) and human performance (85.9%). Of particular interest, for questions that require spatial understanding, even the best VLMs are nearly “blind”—i.e., they perform not much better than text-only models, indicating that models leveraging visual information aren’t substantially benefitting from it and are falling back on priors about the world captured in text to answer visual questions. As an example, for the question “I’m sitting on the living room couch watching TV. Which room is directly behind me?”, the models guess different rooms essentially at random without significantly benefitting from visual episodic memory that should provide an understanding of the space. This suggests that additional improvement on both perception and reasoning fronts are needed before embodied AI agents powered by such models are ready for primetime.
OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.
At FAIR, we’re working to build world models capable of performing well on OpenEQA, and we welcome others to join us in that effort.
Our latest updates delivered to your inbox
Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.
Join us in the pursuit of what’s possible with AI.
Foundational models
Latest news
Foundational models