March 17, 2022
Information retrieval (IR), the task of searching for and accessing relevant knowledge, is arguably among the most defining challenges of the information age. People use it every day to find books in a digital library, shoes from an online retailer, songs in a streaming music service, and much more.
To excel at this task, however, an IR system must be able to parse the intricacies and subtleties of human language. If you typed in “blue shoes,” would teal suffice? If you asked about George Clinton’s influence on hip-hop, should the system give you articles about the funk musician or the 19th-century U.S. vice president?
Neural models would be the natural solution because of their ability to understand language deeply, but they are not widespread in IR due to computational constraints and scale. Just as people in knowledge-intensive jobs are commonly required to access knowledge on the web, neural networks must search larger-scale knowledge sources in an efficient way. Recently, researchers have made great strides in improving the accuracy and efficiency of pretrained language models.
Today, we’re sharing cutting-edge dense retrieval models that generalize well to tasks outside the training set while maintaining similar efficiency and scalability as traditional text matching systems. These new techniques will help pave the way for ubiquitous neural information retrieval, which will improve search as we currently use it and enable new AI agents that can responsibly harness the full knowledge on the internet, or more intelligent fact-checking systems to combat misinformation.
Traditional IR techniques, such as TF-IDF and BM25, calculate how many words and phrases in candidate documents match those in the query. These functions assign more importance to uncommon terms, which likely have a greater influence on the meaning of the sentence. (Common words, such as “the” or “a,” are usually less informative.)
These heuristics can scale extremely well with distributed inverted indices, which store lists of the documents each word appears in, and have been remarkably effective. However, these techniques don’t comprehend the text’s actual meaning, and so they are oblivious to the significance of synonyms, paraphrases, inversions, and other nuances. They are also rigidly defined and cannot adapt to different definitions of relevance in other contexts.
Neural retrieval addresses these shortcomings. Instead of indexing words or phrases directly, a neural encoder translates the query and each document into separate vector representations, assigning pieces of text numerical positions within hundreds of dimensions. The dimensions represent the continuums along which a word, phrase, or passage might be similar to or different from another, including part of speech, case, tense, gender, sentiment, formality, perspective, and a host of other shades of meaning.
In this multidimensional vector space, similar items will cluster together, and unrelated terms will spread apart. With pretrained text encoders, a wealth of semantic and world knowledge can be incorporated into these representations. The retriever then pinpoints relevant documents by determining which passages are closest to the query in the vector space. The relevance function can be learned on a supervised data set of query-document pairs using a bi-encoder architecture (pictured below). This neural retriever architecture is referred to as dense passage retrieval (DPR), after the dense passage retrievalit is based on. With its simple, effective design and our efficient and popular open source implementation, DPR was the first work to demonstrate the effectiveness of neural retrieval in the fully supervised setting, and has been widely adopted in research and industry.
But this basic dense retriever has its own shortcomings. For instance, dense vector representations might fail to capture rare terms that don’t appear in the supervised training data, and the learned model might not generalize well to new domains. The example below shows typical failure cases of a basic dense retrieval system compared with a traditional IR system.
One way to create a more generalizable dense retrieval model is to train it on multiple tasks simultaneously: We created a multitask dense retrieval model. We trained a neural retriever on eight knowledge-intensive NLP tasks, ranging from fact checking to entity linking to question answering, in the first application of multitasking to neural retrieval. The resulting model shows strong zero-shot generalization: It has been pretrained to learn the similarities and differences between data points; the model can now adapt to a new task with little to no fine tuning.
Another approach is to augment the training data with synthetic examples: An AI model generates queries based on text in the knowledge source, and we train the network to retrieve the original documents when fed these queries. We have experimented with several methods of creating artificial data. For example, we assembled a pipeline of models to generate 65 million questions from Wikipedia paragraphs, creating a corpus two orders of magnitude larger than a typical fully supervised retrieval data set. We then used the questions to pretrain robust, high-performing dense retriever models
To teach a dense retriever to recognize rare terms, we generated artificial examples by querying a traditional IR system. Since a traditional retriever can be queried with arbitrary text, we randomly selected sentences from each Wikipedia passage in the corpus. Using those examples, we built a salient phrase-aware dense retriever, which combines the advantages of traditional and neural systems in a single architecture.
We can create an unlimited amount of artificial training data by cropping passages from the knowledge source and deleting, masking, or replacing random words. We used this sort of data to train models that, with no supervision, match the zero-shot generalization ability of traditional IR systems.
Given the default sizes of the vector representation created by standard pretrained encoders, an index of several million documents — a modest number for a modern knowledge source — can exceed the memory limits of a single server. To address this, we introduced distributed-faiss., which apportions our popular and efficient FAISS vector search library across multiple machines. We have used it with a collection of one billion documents, demonstrating the feasibility of neural retrieval at a large scale.
Ultimately, two factors determine whether a dense retrieval model can be deployed in real-world applications: the size of the index and the retrieval-time latency. Both factors are directly correlated with vector representation size.
In DrBoost, we tackle this problem by training an ensemble of dense retrievers in stages. We incrementally develop weak retrievers of very compact representation size. Component models are taught in sequence; they learn to specialize by focusing only on mistakes made by the current ensemble. The final representation is the concatenation of the output vectors from all the component models. Ensembles trained in this way can match the accuracy of standard dense retrieval models but with vectors a quarter or one-fifth the size.
DrBoost also performs superbly under fast approximate search, the setting which is most relevant for real-world, large-scale applications. DrBoost can retain accuracy while reducing bandwidth and latency requirements by four- to 64-fold. In principle, this allows for the approximate index to be served on-disk rather than in expensive and limited RAM, making it feasible to deploy dense retrieval systems more cost effectively and at a much larger scale.