Introducing long-form question answering


To help advance question answering (QA) and create smarter assistants, Facebook AI is sharing the first large-scale dataset, code, and baseline models for long-form QA, which requires machines to provide long, complex answers — something that existing algorithms have not been challenged to do before. Current systems are focused on trivia-type questions, like whether jellyfish have a brain. Our dataset goes further by requiring machines to elaborate with in-depth answers to open-ended questions, such as “How do jellyfish function without a brain?” Furthermore, our dataset provides researchers with hundreds of thousands of examples to advance AI models that can synthesize information from multiple sources and provide explanations to complex questions across a wide range of topics.

For truly intelligent assistants that can help us with myriad daily tasks, AI should be able to answer a wide variety of questions from people beyond straightforward, factual queries such as “Which artist sings this song?” Most existing QA tasks are constrained — both to specific knowledge domains and to answers of a single word or phrase from the input passage. They require identifying a simple fact in a single web document, which is then presented as the answer, but existing QA systems can’t offer rich explanations the way people do.

Free-form responses require not only finding relevant information on the web but also synthesizing this information into multiple sentences. To make progress in long-form question answering, researchers need a large, diverse dataset of complex how- and why-type questions with paragraph-length answers. Our new long-form QA dataset challenges existing algorithms because it requires processing many web documents comprising hundreds of thousands of words, identifying the relevant information in those documents, and writing a longform response to an often open-ended question. Previous work has proposed datasets with some of these components, but not all of them together.

Building a new challenge for the QA community

We’ve created a large-scale, high-quality dataset, together with web documents, as well as two pretrained models. To build the dataset, we leveraged a public subreddit titled “Explain Like I’m Five” (ELI5), in which an online community answers questions with responses that 5-year-olds can comprehend. The dataset comprises 270K threads of diverse, open-ended questions that require multi-sentence answers. We chose this subreddit because it does not heavily rely on preexisting knowledge of the world and it contains relatively simple language to help models provide answers seamlessly. To aid in answering questions, we provide pointers to webpages that have relevant information.

Current QA challenges

  • Q: What’s the nearest restaurant?

  • Q: What is the largest lake in the world?

  • Q: What time is it in Tokyo right now?

Long-form QA challenges

  • Q: Why are some restaurants better than others if they serve basically the same food?

  • Q: What are the differences between bodies of water like lakes, rivers, and seas?

  • Q: Why do we feel more jet lagged when traveling east?

Normally, when building such datasets, questions are manually derived from a given paragraph, which means that the answers are clearly apparent. We adopt a different approach by taking the questions found on the subreddit as a starting point, then gathering relevant documents. We can then train our AI models to answer the questions that people actually ask. To obtain those documents, we built a search engine for our database of 270K questions so the model will map each question to 100 possible web documents. This approach forces our model to read multiple sources in order to develop a complete explanation for each question.

QA models for ELI5 mimic what many people do when they're asked a question: If they don’t know the answer, they’ll likely search the web to learn about the topic, read a few of the results, and then provide the answer. ELI5 combines the challenges of synthesizing information from multiple sources, answering questions, and generating text into a real-world task, making it a more realistic and difficult task than prior QA datasets.

Training extractive and abstractive models

With this open source dataset, we’re also introducing models for two directions of further research: extractive models, which produce answers that are copied word for word from the supporting documents, and abstractive models, which can rewrite the information in the supporting documents as needed. Both are strong, general approaches that researchers can use as a baseline to further explore long-form QA research.

Extractive: Multisentence model

We train a generalized version of the bidirectional attention flow (BidAF) model to identify relevant sentences from the web search information and paste these sentences together to form a relevant answer. Previously, BidAF models have been trained to identify a single extractive span answer (a phrase or a few words) from the input text. Because our dataset comprises complex questions, we extended this model to select multiple spans to obtain more content.

In testing, we found that our new BidAF model is stronger than the baseline, which we measure using term frequency-inverse document frequency (TFIDF).

Abstractive: Seq2seq multitask model

We use a sequence-to-sequence (seq2seq) approach for abstractive modeling to synthesize information from various web sources to write a paragraph-length answer. Standard seq2seq models receive a training signal only from predicting the answer, whereas a language model approach would be trained to predict the question, web source, and answer.

To improve performance, we train seq2seq models with multitasking to combine the benefits of language modeling with seq2seq. We do this by tackling multiple tasks during training, applying the resulting model on the standard QA task of reading the question and documents, and then writing the answer. It turns out that this multitask seq2seq approach outperforms standard language modeling and seq2seq techniques.

Further exploration

Today, assistants are immediately available to help us answer our questions through mobile devices. But current question answering technology is focused on short, factual answers. Think of responses to “Which artist sings this song?” If we want assistants that are useful all around, then we need to have AI that can respond well to a variety of how- and why-type questions. Think “What’s the meaning behind these lyrics?” This requires synthesizing information from multiple sources and being able to provide long explanations. The ELI5 dataset and the accompanying baseline models help us make progress toward this goal.

We provide both extractive and abstractive models to produce on-topic answers and demonstrate the ability of our models to read through relatively large quantities of noisy web information. Our results show there’s a clear gap between answers that models write and answers that people write — which creates more opportunities for future research in long-form QA.

In addition to dialog research, we're also focused on improving language understanding, exemplified through recent research on QA tasks that require difficult reasoning and embodied QA , which combines perception, navigation, and communication. In the long term, these models could collectively affect the way people access information, particularly in the development of more intelligent assistants.

You can find examples, models, and the dataset here

Read the full paper here and see this work presented at ACL2019.

Written By