Machine translation (MT) is one of the most successful applications of natural language processing (NLP) today, with systems surpassing human-level performance in some language translation tasks. These advances, however, rely on the availability of large-scale parallel corpora, or collections of sentences in both the source language and corresponding translations in the target language. Current high-performing MT systems need large-scale parallel datasets to fit hundreds of millions of model parameters that are currently required to make accurate predictions. The vast majority of languages today, though, don’t have such resources, and we need to build systems that can work effectively for everyone.
In our pursuit of building more powerful, flexible MT systems, Facebook AI has recently achieved two major milestones for low-resource languages.
We’ve developed a novel approach that combines several methods, including iterative back-translation and self-training with noisy channel decoding, to build the best-performing English-Burmese MT system at the Workshop on Asian Translation (WAT) competition, with a gain of 8+ BLEU points over the second-best team.
We’ve also developed a state-of-the-art approach for better filtering of noisy parallel data from public websites with our LASER (Language-Agnostic SEntence Representations) toolkit. Using this technique, we have taken first place for the shared task on corpus filtering for low-resource languages of Sinhala and Nepali at the Fourth Conference on Machine Translation (known as WMT).
These research advancements were monumental stepping stones in helping us provide support for translating several new low resource languages across Facebook’s family of apps --- most recently including: Lao, Kazakh, Haitian, Oromo, and Burmese. We power nearly 6 billion translations per day, and these new complementary approaches help us provide a better experience across our family of apps.
In the absence of the necessary parallel dataset typically required to train MT systems, we leverage monolingual data, or text without any translation, as it’s often more abundant than parallel data.
Back-translation, the current workhorse for leveraging monolingual data in MT, carries two assumptions. One, there must be a significant volume of monolingual data in the target language. Two, the subject matter of sentences in both source (e.g., English) and target languages (e.g., Burmese) should be comparable. These assumptions are not always met in low-resource settings because low-resource languages (e.g., Burmese) often have less monolingual data available than what’s required for back-translation. It’s also common for topic distribution to vary widely because of differences in the local context and cultures, as we explored in this recent paper.
Our new state-of-the-art approach for English-Burmese MT addresses these challenges by combining iterative back-translation, which helped us win the best paper award at EMNLP 2018, with self-training. When translating from English to Burmese, self-training enables us to leverage English monolingual data on the source side, as opposed to Burmese monolingual data on the target side. This method is useful because English monolingual data originates in English-speaking countries and has a similar distribution of topics with the sentences we want to eventually translate. We’re sharing a step-by-step guide for our approach to using both back-translation and self-training.
First, using the limited parallel data available, we train a reverse MT system that translates from the target to the source language. We then use this system to translate monolingual data on the target side, which produces a synthetic dataset composed of back-translations from the reverse MT system and the corresponding target sentences.
Our approach to back-translation leverages our previous work on improving beam search through noisy channel reranking, which was key to winning the WMT 2019 competition for four language directions, and we use it for not only the forward system but also the reverse system to improve the quality of our back-translated data. Using the standard back-translation approach, we’d typically augment the small, parallel dataset with the synthetic dataset and train our desired forward translation system. In our approach, however, we also add forward translations in the second step as detailed next.
Second, we use the self-training algorithm, a method that we improved and analyzed in detail in another recent paper, in which we first train a forward MT system on only the small parallel dataset. We then use it to translate the monolingual data on the source side using noisy channel reranking. This process, again, produces a synthetic dataset composed of sentences from the source monolingual dataset with the corresponding machine translations. Finally, we retrain the forward system on the concatenation of the original parallel dataset with our synthetic dataset, yielding a better MT model.
We combine both back-translation and self-training in an iterative way as each has complementary strengths. Although self-training leverages source data that’s in-domain, the reference translations are likely to be inaccurate because they are machine generated. On the flip side, back-translation data is typically not in-domain because it originates from the target language but uses accurate references because they are human-generated sentences.
After using back-translation and self-training to produce synthetic datasets with back-translated and forward-translated sentences, respectively, we combine all these datasets and retrain both a (potentially bigger) forward model and a reverse model. As the translation quality of these models improve through retraining, we reuse them to retranslate the source and target monolingual datasets. This produces an iterative algorithm that alternates between translating monolingual data using the current MT systems and retraining the MT systems using the additional machine-generated parallel data. In practice, we use up to three iterations.
Burmese is a particularly challenging language for MT not only because parallel data is scarce but also because Burmese is morphologically very rich, lacks word segmentation, and contains multiple encodings to represent the script. Nonetheless, our model learns to adapt to all of these distinctive features of the language. Because we are using a high volume of monolingual data, the model learns useful regularities that help properly represent this language.
Using our novel approach, we won first place at WAT, with a gain of more than 8 BLEU points over the second-best team. This is a great step forward; as a reference point, research papers presented at major conferences in this field often report gains lower than 1 BLEU point. Our model translations in both directions were also significantly preferred by human evaluators both in terms of fluency and adequacy.
Another alternative and complementary approach for improving low-resource MT has been our LASER open source toolkit, which enables us to filter noisy parallel datasets and extract high-quality data for training from freely available datasets. We used LASER to score sentence pairs in order to gauge the quality of their translations. This was a key method in achieving the best results in the recent corpus filtering WMT competition for Nepali-English and Sinhala-English.
The toolkit uses a “universal” encoder, which is a single encoder that works across all languages, to grasp the meaning of a sentence and produce its vectorial representation. Using these multilingual representations, we can compute a similarity score for each sentence pair, which we then use to filter data and ensure we identify the highest quality source and target sentence pairs for training translation models. The similarity score for a sentence pair, regardless of the language, is based on normalized cosine similarity of their vector representations. This method is also effective on languages that LASER does not support but share a similar script with another language that LASER does support. By applying this scoring function to noisy data, we're able to find "needles in the haystack,” sentences that are useful to train an MT system from scratch.
For evaluating the trained MT models for both language pairs, the organizers used Facebook Low Resource (FLoRes) dataset, the first high-quality, publicly available benchmark designed for evaluating low-resource MT systems released earlier this year. Our system achieved 22% higher results (+1.4 BLEU) as compared to the second-best entry. For the same task in Nepali-English, we achieved 20% better BLEU (+1.3 BLEU) compared with the second-best entry. Read our full paper on low-resource corpus filtering here.
We deployed LASER to clean noisy parallel data from freely available public datasets. As a result, we have seen an increase in the quality of translations by an average of +2.9 BLEU points across low-resource languages, including Kmer, Somali, Marati, Xhosa, and Zulu. For these languages, this was a zero-shot scenario in which we were able to compute similarities between languages that weren’t used in training time.
This method is also useful for building new, publicly available datasets to help the MT community train and evaluate models. For instance, we recently used LASER and the Faiss (Facebook AI Similarity Search) library to build and release Wikimatrix, the largest and most complete parallel dataset, including 135 million sentences in 1,620 language pairs across multiple languages from Wikipedia.
Bringing research models to production requires overcoming a new set of challenges, like speeding up such large research models while maintaining computational efficiency. We need to balance specific tradeoffs between flexibility to support exploratory research and efficiency to scale to billions of translations a day.
First, we use PyTorch for fast iteration in training and development in both research and deployment. More specifically, we train neural machine translation (NMT) models using PyTorch’s fairseq, which supports scalable and efficient training, including distributed multi-GPU, large batch size through delayed updates, and FP16 training. Efficient training is key when training on low resource languages, since these leverage large amounts of monolingual data via either back-translation or self-training. Fairseq allows the model to benefit from scaling up back-translations and self-training. Then we use TorchScript to export the trained PyTorch model to an efficient graph model, which is ready to be deployed in the production infrastructure.
Next, we need to make sure the model we deploy satisfies the latency and throughput constraints to reliably serve all the traffic present on our platform. Because high-performing models, such as those that won these MT competitions, would be too computationally intensive, we use sequence-level knowledge distillation to transfer knowledge gained from iterative back-translation and self-training to a smaller and faster production model. Specifically, the student model is trained with a combination of original parallel data and translated dataset the teacher model generated by using beam search. Finally, we down-project the decoder hidden states to a smaller dimension before the softmax layer because the hidden states are responsible for the bulk of the computation.
By implementing a range of different techniques focused on improving low-resource translations — like LASER, back-translations, self-training, multilingual modeling, — we’ve improved the quality of many low-resource languages in production. For instance, Sinhala to English and Nepali to English translations on Facebook have improved from “useful,” which are just accurate enough to understand the meaning, to “good,” which generates full meaning but may have typos or grammatical errors.
Low-resource MT is sub-field of NLP that’s still in its infancy. In the future, we’ll continue developing new methods that can learn better and more efficiently from smaller parallel datasets by leveraging monolingual data, noisy parallel data, and data in other languages. Progress in this area will help remove language barriers and bring communities closer together. Advancing low-resource MT will also improve personalization of experiences as well as detection of policy-violating content to help keep people safe.
Enabling translations for all the languages of the world is a major open challenge and, in order to accelerate advancements, it’s important that we work together with the broader research community. This is why we’ve released training platforms and code so that researchers can reproduce our models. We’ve also released essential public datasets to provide better evaluation for low-resource MT, like FloRes and the Wikimatrix datasets. These tools are important for measuring true progress and move research forward. And we’ve launched an AI Language Research Consortium, bringing together community partners to focus on NLP problems, including low-resource MT. Facebook AI is committed to open science and sharing reproducible research to lower the entry barrier in the field and fuel progress in the field.
More generally, this research is part of a broader effort towards methods that rely less on supervision (in this case, the use of professional translators), as the hallmark of truly intelligent systems is being able to quickly learn and adapt from a limited set of examples. Facebook AI is pursuing several approaches toward this goal across several domains, including vision, speech and text understanding.
We’d like to thank our colleagues Myle Ott, Matt Le, Liezl Puzon, Don Husa, Nikhil Gupta, Chau Tran, and Holger Schwenk for their contributions to this blog post.