April 2, 2021
Recent progress in machine translation (MT) has broken down longstanding language barriers, enabling people to communicate with others around the world, travel with greater confidence, and read the news in another country — or just read the menus at the local cafe. But unfortunately these benefits are currently limited to languages where enough data exists to create modern translation systems. The vast majority of languages in the world, even ones with millions of speakers, lack sufficient data. These are commonly referred to as low-resource languages, and we’ve previously released the FLORES data sets and open-source models to enable further research into these under-resourced languages.
To spur progress on the challenge of low-resource translation, we are launching the Large-Scale Multilingual MT Track at the Workshop for Machine Translation (WMT), the premier academic translation competition. Researchers are invited to submit their strongest multilingual translation models to compete in three tracks: a small track focused on five low-resource Eastern European languages, a small track focused on five low-resource Southeast Asian languages, and a full track covering more than 100 languages.
Training data is drawn from the publicly available OPUS repository, in addition to monolingual Wikipedia data for each language. Only the provided data may be used for the small tracks, so as to facilitate the comparison of models and methods. The large track will be unconstrained. The evaluation server will open June 4, while the final evaluation will take place from August 9 to 13, 2021. For more details on the training data, validation and test data, and the key competition dates, please visit the competition site.
While progress in translation is driven by the entire research community, we also know that many interested researchers may not have access to sufficient computing resources. Grants will be provided for cloud compute, in hopes that it enables participation from those who might not otherwise have the capability. We’re launching a request for proposals, and encourage all interested researchers to apply. Researchers will be asked to submit a short statement on how they plan to use the GPU compute to advance research in low-resource machine translation with the FLORES data set in WMT 2021.
For billions of people globally, language is a barrier to accessing information and gaining valuable new experiences. Unless you speak one of the handful of languages that dominate the web, many aspects of the modern world are difficult if not impossible — from searching the internet for information to navigating an unfamiliar country. With FLORES and the associated WMT-Multilingual Track and compute grants, we hope that the machine translation community will be able to make more rapid progress on low-resource translation and help create a more connected and communicative world.
Visit our competition site for more information.