September 24, 2021
It’s been one year since Facebook AI launched Dynabench, a first-of-its-kind platform that radically rethinks benchmarking in AI. Starting today, we’re unlocking Dynabench’s full capabilities for the AI community — AI researchers can now create their own custom tasks to better evaluate the performance of natural language processing (NLP) models in more flexible, dynamic, and realistic settings for free.
This new feature, called Dynatask, makes it easy for researchers to leverage human annotators to actively fool NLP models and identify weaknesses through natural interactions. This dynamic approach arguably better reflects the way people behave and react as compared with previous benchmarks, which test against fixed data points and are prone to saturation. Researchers can also use our evaluation-as-a-service capabilities and compare models on our dynamic leaderboard, which goes beyond just accuracy and explores a more holistic measurement of fairness, robustness, compute, and memory.
Dynabench initially launched with four tasks: natural language inference (created by Yixin Nie and Mohit Bansal of UNC Chapel Hill, question answering (created by Max Bortolo, Pontus Stenetorp, and Sebastian Riedel of UCL), sentiment analysis (created by Atticus Geiger and Chris Potts of Stanford), and hate speech detection (Bertie Vidgen of Turing Institute and Zeerak Waseem Talat of University of Sheffield/Simon Fraser University).
Over the past year, we’ve launched a visual question answering task and low-resource machine translation tasks. We also powered the multilingual translation challenge at the Workshop for Machine Translations. Cumulatively, the dynamic data collection efforts have, so far, resulted in eight published papers, 400K raw examples, and four open source large-scale data sets.
“Dynatask opens up a world of possibilities for task creators. They can set up their own tasks with little coding experience, easily customize annotation interfaces, and enable interactions with models hosted on Dynabench. This makes dynamic adversarial data collection considerably more accessible to the research community,” Max Bartolo, University College London.
Now, we hope that by enabling custom NLP tasks for the entire AI community, we’ll empower the field to explore entirely new research directions. High-quality and holistic model evaluation is critical to the long term success of AI, and we believe that Dynabench as a collaborative effort will play an important role in the future of benchmarking.
Dynatask is highly flexible and customizable. A single task can have one or more owners, who define the settings of each task. For example, owners can choose which existing data sets they want to use in the evaluation-as-a-service framework. They can select from a wide variety of evaluation metrics to measure model performance, including not only accuracy but also robustness, fairness, compute, and memory. Anyone can upload models to a task’s evaluation cloud, where scores and other metrics are computed on the selected data sets. Once those models have been uploaded and evaluated, they can be placed in the loop for dynamic data collection and human-in-the-loop evaluation. Task owners can also collect data via the web interface on dynabench.org or with annotators (such as Mechanical Turk).
Let’s walk through a concrete example that illustrates the different components. Suppose there were no Natural Language Inference tasks yet, and you wanted to start one.
Step 1: Log into your Dynabench account and fill out the “Request new task” form on your profile page.
Step 2:Once approved, you will have a dedicated task page and corresponding admin dashboard that you control, as the task owner.
Step 3: On the dashboard, choose the existing datasets that you want to evaluate models on when they are uploaded, along with the metrics you want to use for evaluation.
Step 4: Next, submit baseline models, or ask the community to submit them.
Step 5: If you then want to collect a new round of dynamic adversarial data, where annotators are asked to create examples that fool the model, you can upload new contexts to the system and start collecting data through the task owner interface.
Step 6: Once you have enough data and find that training on the data helps improve the system, you can upload better models and then put those in the data collection loop to build even stronger ones.
We used the same basic process to construct several dynamic data sets, like Adversarial Natural Language Inference. Now, with our tools available for the broader AI community, anyone can construct data sets with humans and models in the loop.
Dynabench is centered on community. We want to empower the AI community to explore better, more holistic, and more reproducible approaches to model evaluation. Our goal is to make it easy for anyone to construct high quality human-and-model-in-the-loop data sets. With Dynatask, you can move beyond accuracy-only leaderboards toward a more holistic evaluation of AI models that are more closely aligned with the expectations and needs of the people who interact with them.
At Facebook AI, we believe in collaborative open science, scientific rigor, and responsible innovation. Of course, the platform will continue to evolve and change as the community grows, befitting its dynamic nature. We invite you to join our Dynabench community.
Create new examples that fool existing models, upload new models for evaluation, or request your own Dynabench task now.