Natural language understanding (NLU) and language translation are key to a range of important applications, including identifying and removing harmful content at scale and connecting people across different languages worldwide. Although deep learning–based methods have accelerated progress in language processing in recent years, current systems are still limited when it comes to tasks for which large volumes of labeled training data are not readily available.
Recently, Facebook AI has achieved impressive breakthroughs in NLP using semi-supervised and self-supervised learning techniques, which leverage unlabeled data to improve performance beyond purely supervised systems. We took first place in several languages in the Fourth Conference on Machine Translation (WMT19) competition using a novel kind of semi-supervised training. We’ve also introduced a new self-supervised pretraining approach, RoBERTa, that surpassed all existing NLU systems on several language comprehension tasks. These systems even outperform human baselines in several cases, including English-German translation and five NLU benchmarks. Across the field, NLU systems have advanced at such a rapid pace that they’ve hit a ceiling on many existing benchmarks. To continue advancing the state of the art, we partnered with New York University (NYU), DeepMind Technologies, and the University of Washington (UW) to develop a brand-new benchmark, leaderboard, and PyTorch toolkit, made up of tasks that we hope will push research further.
Together, these new tools will help us create stronger content understanding systems that can translate hundreds of languages and understand intricacies such as ambiguities, co-references, and commonsense reasoning — with less reliance on the large amounts of labeled training data that’s required of most systems today.
For neural machine translation (NMT) models, supervised training typically requires a large volume of sentences for which we have reference translations. Large amounts of high-quality bilingual data, however, is not widely available in general, encouraging researchers to use monolingual data, for which no translations are available. Back translation (a semi-supervised learning technique) allows us to partially overcome this issue. Our most recent submission to WMT builds upon our earlier work on large-scale sampled back translation, which helped us win first place in the same competition last year.
This year, we introduced a new method, to further improve our back-translation system by generating many candidate translations and choosing the one that best balances three different model scores: forward, backward, and fluency. The forward score looks at how well the candidate translation captures the meaning of the original sentence. The backward score looks instead at how well you could reconstruct the original sentence from the candidate translation. The last score measures the fluency of the candidate translation and is trained in a self-supervised way by looking at large quantities of monolingual data. By balancing these three scores, we are able to produce significantly improved translations.
As a result, year over year, we’ve increased performance of the English to German translation task by 4.5 BLEU (a metric that measures the degree of overlap between the generated translation and a professional reference), which is a large improvement. According to human evaluations, our models were ranked top in four translation tasks: from English to German, German to English, English to Russian, and Russian to English. Our English to German translations outperformed human translators, as assessed by WMT's human judges.
The image above shows how this technique works: First, a forward model translates a sentence, such as from German to English, generating a set of English translations, or hypotheses. A backward model then translates those English hypotheses back into German, allowing the system to evaluate how well each English translation appears to line up with the original German sentence. Finally, a language model judges the fluency of the English translations.
We also scaled training to much larger datasets, incorporating roughy 10 billion words for the English to German setting. We used more than twice the amount of monolingual data for semi-supervised training compared with last year, further improving translation accuracy. For more details, read Facebook AI leads in 2019 WMT international machine translation competition.
We recently optimized and improved upon one of the biggest breakthroughs in natural language processing (NLP), which Google released in 2018: the Bidirectional Encoder Representations from Transformers, or BERT. BERT is revolutionary because it demonstrates the potential for self-supervised training techniques to match or exceed the performance of traditional, label-intensive supervised approaches. For instance, we leveraged BERT and related approaches to push cutting-edge research in conversational AI, improve content understanding systems, and improve low-resource and unsupervised translation quality
Because Google open-sourced BERT, we were able to conduct a replication study and identify design changes that further improve its effectiveness. We introduced Robustly Optimized BERT Pretraining Approach, or RoBERTa which achieves new state-of-the-art results.
RoBERTa modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective and training with much larger minibatches and learning rates. We also train for much longer over 10x more data overall, as compared with BERT. This approach led to new state-of-the-art results on the widely used NLP benchmarks, General Language Understanding Evaluation (GLUE) and ReAding Comprehension from Examinations (RACE).
With an average score of 88.5, RoBERTa earned the top position on the GLUE leaderboard, matching the performance of the previous leader, XLNet-Large, with an average score of 88.4. RoBERTa also advanced the state of the art on several language understanding benchmarks, including MNLI, QNLI, RTE, STS-B, and RACE tasks.
This achievement is part of our ongoing commitment to advancing the performance and potential of self-supervised systems that are less reliant on data labeling. For more details on RoBERTa, read RoBERTa: An optimized method for pretraining self-supervised NLP systems.
As the industry standard of measuring progress in research, GLUE was meant to cover a wide swath of NLP tasks, so that the only way to perform well was to build tools general enough to help solve most new language understanding problems.
Within one year of release, several NLP models (including RoBERTa) have already surpassed human baseline performance on the GLUE benchmark. Current models have advanced a surprisingly effective recipe that combines language model pretraining on huge text datasets with simple multitask and transfer learning techniques.
This rapid pace of advancement is a function of the collaboration within the larger AI community. The NLP competitions, benchmarks, and code releases described above enable model replication, improvements, and faster advances in state-of-the-art results. Model performance on GLUE jumped sharply with the introduction of GPT and BERT, and now have crossed human performance with recent models, as shown in this figure:
Although current models can surpass human level performance on specific GLUE tasks, they are not yet able to solve some of the tasks humans solve perfectly. To set a new, higher bar for NLP research, Facebook AI partnered with NYU, DeepMind, and UW to construct SuperGLUE, a much harder benchmark with comprehensive human baselines. We are launching SuperGLUE to allow language understanding researchers to continue advancing the state of the art.
Both the original and new benchmarks were created in collaboration with the same partners, with NYU leading the effort. SuperGLUE follows in the footsteps of GLUE, which offers a single-number metric that summarizes progress on a diverse set of NLP tasks. In addition to the new benchmark, we are releasing a leaderboard and a PyTorch toolkit for bootstrapping research.
SuperGLUE comprises new ways to test creative approaches on a range of difficult NLP tasks focused on innovations in a number of core areas of machine learning, including sample-efficient, transfer, multitask, and self-supervised learning. To challenge researchers, we selected tasks that have varied formats, have more nuanced questions, have yet to be solved using state-of-the-art methods, and are easily solvable by people. To check for these tasks, we ran BERT-based baselines for many candidate tasks and collected data for human baselines.
The new benchmark includes eight diverse and challenging tasks, including Choice of Plausible Alternatives (COPA), a causal reasoning task, in which a system is given a premise sentence and must determine either the cause or effect of the premise from two possible choices. Notably, humans have obtained 100 percent accuracy on COPA while BERT achieves just 74 percent, which demonstrates a large opportunity for progress.
Other unique, leading-edge components include a diagnostic tool to measure biases in these models. Specifically, we include Winogender, which is designed to test for the presence of gender bias in automated co-reference resolution systems. SuperGLUE also includes a question answering (QA) task called BoolQ, in which each example consists of a short passage and a yes or no question about the passage; it serves as a good proxy for the Natural Questions benchmark.
Similar to GLUE, the new benchmark also consists of a public leaderboard built around language understanding tasks, drawing on existing data, accompanied by a single-number performance metric and an analysis toolkit.
We recently tested RoBERTa against the new benchmark, and RoBERTa outperformed all existing NLU systems, even surpassing the human baseline on the Multisentence Reading Comprehension (MultiRC) task. Still, there remains a large gap between RoBERTa and the human baselines on many of the SuperGLUE tasks, illustrating some of the limitations of today’s state-of-the-art NLU systems.
To further challenge what AI systems can help us with, we also introduced the first long-form question answering dataset and benchmark, which requires machines to provide long, complex answers — something that existing algorithms have not been challenged to do before. Current question answering systems are focused on trivia-type questions, such as whether jellyfish have a brain. This new challenge goes further by requiring machines to elaborate with in-depth answers to open-ended questions, such as “How do jellyfish function without a brain?” Existing algorithms are far from human performance, and this new challenge will push AI to synthesize information from different sources to provide complex responses to open-ended questions. .
All the work described here has been part of a larger movement that is rapidly advancing the state of the art in language processing. By releasing new standards for measuring progress, introducing new methods for semi-supervised and self-supervised learning, and training over ever-larger scales of data, we hope to inspire the next generation of innovation. By challenging one another to go further, the NLP research community will continue to build stronger language processing systems.