July 03, 2020
Benchmarks play a crucial role in driving progress in AI research. They form a common goal for the research community and allow for direct comparisons between different architectures, ideas, and methods. But as research advances, static benchmarks have become limited and saturate quickly, particularly in the field of natural language processing (NLP). For instance, when the GLUE benchmark was introduced in early 2018, NLP researchers achieved human-level performance less than a year later. SuperGLUE added a new set of more difficult tasks, but it was also soon saturated as researchers built models that could achieve “superhuman” performance on the benchmark.
Such static benchmarks can result in building models that not only overfit on these benchmarks but also pick up on inadvertent biases that may exist, rather than truly understanding language. Famously, simply answering the number “2” in response to quantitative “How much?” questions in some QA datasets can yield unexpectedly high accuracy. While there has been rapid progress in NLP, AI systems are still far off from truly understanding natural language. This raises the question, are our benchmarks measuring the right thing? Can we make a benchmark more robust and last longer?
In order to provide a stronger NLP benchmark, we’re introducing a new large-scale dataset called Adversarial Natural Language Inference (ANLI). NLI is a core task of NLP and a good proxy for judging how well AI systems understand language. The goal is to determine whether a statement can be inferred (positive entailment) from a given context. For example, the statement “Socrates is a man, and men are mortal” entails “Socrates is mortal,” while “Socrates is immortal” is a contradiction and “Socrates is a philosopher” is neutral.
We took a novel, dynamic approach to building the ANLI dataset, in which human annotators purposely fool state-of-the-art models on such NLI tasks, creating valuable, new examples for training stronger models. By repeating this over several rounds, we iteratively push state-of-the-art models to improve on their weaknesses and create increasingly harder test sets. If a model overfits or learns a bias, we can add a new round of examples to challenge the model. As a result, this dynamic, iterative approach makes this task impossible to saturate and represents a new, robust challenge for the NLP community.
Our novel approach to data collection is called HAMLET (Human-and-Model-in-the-Loop-Entailment Training). We employed human annotators to write statements that purposely try to make state-of-the-art models predict the wrong label for a given context (or premise). We randomly sampled the contexts from publicly available third-party datasets. If they succeeded in fooling the model, we gave them a bigger reward, incentivizing annotators to come up with hard examples that are valuable for training more robust models. For each human-generated example that is misclassified, we also asked the annotator to provide a reason that they believe the model failed, which is then verified by another person.
We repeated the procedure over three rounds, collecting examples against different models that become increasingly stronger as they’re trained on the newly collected data. We show that this process results in annotators creating more difficult examples, which are consequently more valuable for training. The collected examples pose a dynamic challenge for current state-of-the-art systems, which perform poorly on the new dataset.
Current static benchmarks are struggling to keep up with progress in NLP. With our new HAMLET approach and ANLI dataset, where both models and humans are in the loop interactively, we can push state-of-the-art models toward meaningful improvements in language understanding.
Dynamic adversarial data collection helps us better measure the strength of our models. The harder it is to fool an NLU system, the stronger its ability to truly understand human-level language. Looking forward, we believe that benchmarks should not be static targets. Instead, the research community should move toward a dynamic approach of benchmarking, with current state-of-the-art models in the loop.