June 4, 2021
Machine translation helps bridge the language barriers between people and information — but historically, research has focused on creating and evaluating translation systems for only a handful of languages, usually the few most spoken languages in the world. This excludes the billions of people worldwide who don’t happen to be fluent in languages such as English, Spanish, Russian, and Mandarin.
We’ve recently made progress with machine translation systems like M2M-100, our open source model that can translate a hundred different languages. Further advances necessitate tools with which to test and compare these translation systems with one another, though.
Today, we are open-sourcing FLORES-101, a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world. FLORES-101 is the missing piece, the tool that enables researchers to rapidly test and improve upon multilingual translation models like M2M-100.
We’re making FLORES-101 publicly available because we believe in breaking down language barriers, and that means helping empower researchers to create more diverse (and locally relevant) translation tools — ones that may make it as easy to translate from, say, Bengali to Marathi as it is to translate from English to Spanish today. We’re making the full FLORES-101 data set, an accompanying tech report, and several models publicly available for the entire research community to use, to accelerate progress on many-to-many translation systems worldwide.
Imagine trying to bake a cake — but not being able to taste it. It’s near-impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.
Evaluating how well translation systems perform has been a major challenge for AI researchers — and that knowledge gap has impeded progress. If researchers cannot measure or compare their results, they can’t develop better translation systems. The AI research community needed an open and easily accessible way to perform high-quality, reliable measurement of many-to-many translation model performance and then compare results with others.
Previous work on this problem relied heavily on translating in and out of English, often using proprietary data sets. But while this benefited English speakers, it was and is insufficient for many parts of the world where people need fast and accurate translation between regional languages — for instance, in India, where the constitution recognizes over 20 official languages.
FLORES-101 focuses on what are known as low-resource languages, such as Amharic, Mongolian, and Urdu, which do not currently have extensive data sets for natural language processing research. For the first time, researchers will be able to reliably measure the quality of translations through 10,100 different translation directions — for example, directly from Hindi to Thai or Swahili. For context, evaluating in and out of English would provide merely 200 translation directions.
The flexibility exhibited by FLORES is possible because we designed around many-to-many translation from the start. The data set contains the same set of sentences across all languages, enabling researchers to evaluate the performance of any and all translation directions.
“Efforts like FLORES are of immense value, because they not only draw attention to under-served languages, but they immediately invite and actively facilitate research on all these languages,” said Antonios Anastasopoulos, assistant professor at George Mason University’s Department of Computer Science.
Good benchmarks are difficult to construct. They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.
To that end, we created the FLORES-101 data set in a multistep workflow. Each document was first translated by a professional translator, and then verified by a human editor. Next, it proceeded to the quality-control phase, including checks for spelling, grammar, punctuation, and formatting, and comparison with translations from commercial engines. After that, a different set of translators performed human evaluation, identifying errors across numerous categories including unnatural translation, register, and grammar. Based on the number and severity of the identified errors, the translations were either sent back for retranslation or — if they met quality standards — the translations were considered complete.
Translation quality is not enough on its own, though. There were a number of other considerations we worked through when designing FLORES as a useful resource for the AI community:
Focus on low-resource languages: Unlike most existing benchmarks, over 80 percent of the languages used in FLORES are currently low-resource, with little to no data available to train models that translate these languages.
Many-to-many: As FLORES translates the same set of sentences across all languages, it can be used to evaluate the performance of any of 10,100 different translation directions.
Multidomain: Many translation benchmarks are built to focus on one domain, typically news. But in real life, people want a variety of content translated. FLORES pulls content from several different domains, including news, travel guides, and books on a variety of different topics.
Document-level: Recent trends in machine translation indicate the need for document-level translation, or translation that goes beyond individual sentences and takes the broader context into account. FLORES is constructed to translate multiple adjacent sentences from selected documents, meaning models can measure whether document-level context improves translation quality.
Metadata: Translation benchmarks often have no associated supporting information, making meta-level analysis or incorporation of data such as images and hyperlinks impossible. FLORES provides full metadata along with each translation, including information such as hyperlinks, URLs, images, and the article topic.
Server-side evaluation: We’ve partnered with the Dynabench platform to freely host evaluations for the FLORES benchmark. Evaluation on a server helps ensure that models are being measured in exactly the same way on a hidden test set, which enables scientific comparisons. Further, a wide variety of metrics can be computed on the same set of translations, which means that translation quality can be evaluated holistically from a variety of different metrics.
We’ve created the FLORES data set to enable research into translation for 101 languages, and we hope to advance this cause as a research community. To this end, we collaborated with the Workshop on Machine Translation (WMT) to host the Large-Scale Multilingual Translation shared task. Participants are invited to submit to three possible tracks: Central/Eastern European languages, East Asian languages, or the challenge of 101 x 101 translation. The evaluation will be based on the FLORES data set.
We’ve focused on low-resource languages, providing reliable evaluation data for researchers all over the world to develop translation systems in their own languages. However, we know that lack of compute remains a major barrier for many. We want to enable a broader community of researchers to work on multilingual translation, as working together and in the open accelerates scientific research. As part of the Large-Scale Multilingual Task at the WMT, we partnered with Microsoft Azure to offer compute grants to researchers working on low-resource languages. These grants provide thousands of GPU hours on the Azure platform for researchers to develop translation models.
Our grant recipients come from universities all over the world, with a majority focusing on developing translation systems for the languages they speak themselves as well as for other regional languages. Several are focusing on African and Southeast Asian languages — areas of the world where a large number of different languages are spoken and where improvements in machine translation could have a great impact on communities.
For billions of people, especially non-English speakers, language remains a fundamental barrier to accessing information and communicating freely with other people. While there have been major advances in machine translation over the past few years, both at Facebook AI Research (FAIR) and elsewhere, a handful of languages have benefited most from these efforts. If the aim is to break down these language barriers and bring people closer together, then we must broaden our horizons.
“I think [FLORES] is a really exciting resource to help improve the representation of many languages within the machine translation community,” said Graham Neubig, a professor at the Carnegie Mellon University Language Technology Institute in the School of Computer Science. “It is certainly one of the most extensive resources that I know of that covers so many languages from all over the world, in a domain of such relevance to information access as Wikipedia text.”
The release of FLORES-101 enables — for the first time — accurate evaluation of many-to-many models. But that’s just the start, and by making the FLORES-101 data set available, we hope researchers will be able to accelerate work on multilingual translation models like M2M-100 and develop translation models in more languages, particularly in cases that do not necessarily involve English. Together, we believe the community can make rapid progress on low-resource machine translation that will benefit people around the world.
We’d like to thank Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Marc'Aurelio Ranzato, and Francisco Guzman for their work on this project.
Technical Program Manager