NLP

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

July 07, 2022

Abstract

With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

Download the Paper

AUTHORS

Written by

Xilun Chen

Armen Aghajanyan

Barlas Oguz

Scott Yih

Sonal Gupta

Patrick Huber

Publisher

NAACL

Related Publications

December 07, 2023

CONVERSATIONAL AI

NLP

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Davide Testuggine, Madian Khabsa

December 07, 2023

December 06, 2023

NLP

Polar Ducks and Where to Find Them: Enhancing Entity Linking with Duck Typing and Polar Box Embeddings

Mattia Atzeni, Mike Plekhanov, Frederic Dreyer, Nora Kassner, Simone Merello, Louis Martin, Nicola Cancedda

December 06, 2023

December 04, 2023

NLP

PATHFINDER: Guided Search over Multi-Step Reasoning Paths

Olga Golovneva, Sean O'Brien, Ram Pasunuru, Tianlu Wang, Luke Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz

December 04, 2023

November 30, 2023

SPEECH & AUDIO

NLP

Efficient Monotonic Multihead Attention

Xutai Ma, Anna Sun, Siqi Ouyang, Hirofumi Inaguma, Paden Tomasello

November 30, 2023

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.