Introducing Sphere: Meta AI’s web-scale corpus for better knowledge-intensive NLP

July, 11 2022

“Who won the first Nobel Prize in physics?” If you were stumped by that question 40 years ago, you might have opened up an encyclopedia. Today, you’d just ask the voice assistant on your phone. (It was Wilhelm Conrad Röntgen, by the way, for his discovery of X-rays).

How does the assistant on your phone know the answer? Just like humans did in decades past — by looking it up. In these kinds of question-answering or fact-checking tasks, known collectively as knowledge-intensive natural language processing (KI-NLP), AI models comb through a digital archive for relevant information. The more comprehensive the collection, the more answers it holds.

In the current research landscape, however, KI-NLP architectures face a few key limitations. First, they typically depend on commercial black-box search engines to surface relevant web knowledge. When we use such proprietary search engines, we don't know what we can't see; reader models might miss relevant information because the search engine algorithms rank it too low in the results. Alternatively, retrievers rely on Wikipedia to find relevant knowledge. While Wikipedia is accurate, well formatted, and small enough for the majority of architectures to navigate, it’s also crowdsourced and doesn’t capture all the knowledge available on the web. And its continued growth has made it challenging for editors to double-check every citation or inadvertent biases.

At Meta AI, we’re creating new advancements toward more intelligent AI systems that better leverage real-world knowledge. We’ve created the first white-box retrieval solution using the world’s vastest library — the open web — as a universal, uncurated, and unstructured source of knowledge, to solve multiple KI-NLP tasks at once. Our new knowledge source, Sphere, uses open web data rather than traditional, proprietary search engines. That means other AI researchers can see into and control the corpus, so they can experiment with scaling and optimizing different methods to push retrieval technology forward. Sphere contains 134 million documents — split into 906 million passages of 100 tokens each — representing orders of magnitude more data than the knowledge sources considered in current KI-NLP research. Because Sphere can access far more public information than today’s standard models, it could provide useful information that they cannot.

Sphere is open sourced here. We have tested Sphere on the Knowledge Intensive Language Tasks benchmark, and Sphere surpassed the state of the art on two. We’ve also tested it on a real-world application and developed a model that can successfully review and verify citations in Wikipedia.

Using web snapshot to build a knowledge corpus

To build Sphere, we first culled the text from a real web snapshot by CCNet, a variant of Common Crawl that jettisons redundant material and scores pages based on writing quality.

But why not turn to an existing search engine, rather than depend on Common Crawl? Unlike those proprietary black-box systems, Sphere provides a direct and explainable way to leverage the backbone of state-of-the-art NLP research. Sphere opens access to the whole corpus, which we hope will help us identify our retriever's blind spots.

Researchers can examine all the text in Sphere, so we can tinker with architectures that capitalize on the system’s strengths and zero in on specific weaknesses. That insight will help us build universal KI-NLP models that can handle diverse data.

An open corpus also allows us to experiment with new architectures, such as dense retrievers. In dense retrieval, documents and queries are represented as vectors, which can be easily fed to the reader model. In essence, the reader and retriever speak the same language, so it’s fairly straightforward to optimize them to interact with each other. By contrast, search engines were designed for humans to use, so our systems must communicate with them in natural language — raising the potential for errors in translation.

Overcoming the challenges of billion scale

In one of the most defining developments of our era, the web has opened access to in-depth information on a sweeping range of topics. Need to know when the last time tug-of-war was an official Olympic sport? The web has you covered. But that massive scope is both a blessing and a challenge. For our project to succeed, we needed to confront an important question: Are KI-NLP systems ready for web scale?

We found that for a corpus as big as Sphere, the dense index — which stores vector representations of corpus documents to make them easier for the retriever to find — quickly exceeds typical single-server hardware limits for both GPU and RAM. We responded by building distributed-faiss — a wrapper around FAISS, our open source library for similarity search. FAISS lets us search quickly for similar multimedia documents — a task where traditional query search engines fall short — in billion-scale data sets. The new wrapper, distributed-faiss, helps us apportion indices across multiple machines to manage the computational load.

Next-level language models

There’s no guarantee that traditional search engines will continue to allow AI researchers access to build KI-NLP models. As part of our ongoing commitment to help the AI community, we’re releasing Sphere, along with our precomputed sparse and dense indices and distributed-faiss library, to encourage further experimentation in this area. Sphere will help researchers train retrievers to handle a wider range of documents, preparing automatic systems for some of the web’s thorniest challenges — misinformation, noise, and incoherent text. In the real world, these models could muzzle harmful content and, when combined with a well-designed UI, enhance people’s digital literacy and critical thinking skills.

On the web, of course, we can’t be sure that any particular statement is accurate or that a single page will contain all the information we need. Indeed, some parts of the web are laden with toxic content and misinformation. Our next step is to train models to assess the quality of retrieved documents, detect potential contradictions, prioritize more trustworthy sources — and, if no convincing evidence exists, concede that they, like us, can still be stumped. We’re also continuously pushing new scaling advancements and techniques that will help us pave the way toward more ubiquitous search for better, smarter neural networks.

NOTE: Wikimedia and Meta are not partnering on this project. The project is still in the research phase and not being used to automatically update any content on Wikipedia.

Written By

Fabio Petroni

Research Scientist