No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more. It aims to give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences.
No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more. It aims to give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences.
We’re committed to bringing people together. That’s why we’re using modeling techniques and learnings from our NLLB research to improve translations of low-resource languages on Facebook and Instagram. By applying these techniques and learnings to our production translation systems, people will be able to make more authentic, more meaningful connections in their preferred or native languages. In the future, we hope to extend our learnings from NLLB to more Meta apps.
As we build for the metaverse, integrating real-time AR/VR text translation in hundreds of languages is a priority. Our aim is to set a new standard of inclusion—where someday everyone can have access to virtual-world content, devices and experiences, with the ability to communicate with anyone, in any language in the metaverse. And over time, bring people together on a global scale.
The technology behind the NLLB-200 model, now available through the Wikimedia Foundation’s Content Translation tool, is supporting Wikipedia editors as they translate information into their native and preferred languages. Wikipedia editors are using the technology to more efficiently translate and edit articles originating in other under-represented languages, such as Luganda and Icelandic. This helps to make more knowledge available in more languages for Wikipedia readers around the world. The open-source NLLB-200 model will also help researchers and interested Wikipedia editor communities build on our work.
Experience the power of AI translation with Stories Told Through Translation, our demo that uses the latest AI advancements from the No Language Left Behind project. This demo translates books from their languages of origin such as Indonesian, Somali and Burmese, into more languages for readers—with hundreds available in the coming months. Through this initiative, the NLLB-200 will be the first-ever AI model able to translate literature at this scale.
By Su Nyein Chan
A farmer lives in a village that only grows red roses. What will happen when he plants strange seeds from a box he finds in his basement?
By Prum Kunthearo
When a baby elephant runs into their house, Botom is jealous by how much attention he gets. Can Botom get rid of the elephant, or will she become friends with the lovable creature as well?
By Nabila Adani
A girl is inspired by a school assignment to think about what she wants to be when she grows up. What will her dreams inspire her to become?
By Mohammed Umar
Samad loved animals. His dream was to spend a whole day in a forest and sleep in the treehouse. Follow Samad as he embarked on this adventure where he made wonderful friends and amazing discoveries. Going into a forest has never been so much fun.
By Wulan Mulya Pratiwi
The prince is lost in the forest. A tiger is tracking him. What will he do?
Training data is collected containing sentences in the input language and desired output language.
After creating aligned training data for thousands of training directions, this data is fed into our model training pipeline. These models are made up of two parts: the encoder, which converts the input sentence into an internal vector representation; and the decoder, which takes this internal vector representation and accurately generates the output sentence. By training on millions of example translations, models learn to generate more accurate translations.
Finally, we evaluate our model against a human-translated set of sentence translations to confirm that we are satisfied with the translation quality. This includes detecting and filtering out profanity and other offensive content through the use of toxicity lists we build for all supported languages. The result is a well-trained model that can directly translate a language.
Training data is collected containing sentences in the input language and desired output language.
After creating aligned training data for thousands of training directions, this data is fed into our model training pipeline. These models are made up of two parts: the encoder, which converts the input sentence into an internal vector representation; and the decoder, which takes this internal vector representation and accurately generates the output sentence. By training on millions of example translations, models learn to generate more accurate translations.
Finally, we evaluate our model against a human-translated set of sentence translations to confirm that we are satisfied with the translation quality. This includes detecting and filtering out profanity and other offensive content through the use of toxicity lists we build for all supported languages. The result is a well-trained model that can directly translate a language.
MT is a supervised learning task, which means the model needs data to learn from. Example translations from open-source data collections are often used. Our solution is to automatically construct translation pairs by pairing sentences in different collections of monolingual documents.
The LASER models used for this dataset creation process primarily support mid- to high-resource languages, making it impossible to produce accurate translation pairs for low-resource languages.
Multilingual MT systems have been improved upon over bilingual systems. This is due to their ability to enable "transfer" from language pairs with plenty of training data, to other languages with fewer training resources.
Jointly training hundreds of language pairs together has its disadvantages, as the same model must represent increasingly large numbers of languages with the same number of parameters. This is an issue when the dataset sizes are imbalanced, as it can cause overfitting.
To know if a translation produced by our model meets our quality standards, we must evaluate it.
Machine translation models are typically evaluated by comparing machine-translated sentences with human translations, however for many languages, reliable translation data is not available. So accurate evaluations are not possible.
Learn more about the science behind NLLB by reading our whitepaper and blog, and by downloading the model to help us take this project further.
See model milestones by # of languages released
The first successful exploration of massively multilingual sentence representations shared publicly with the NLP community. The encoder creates embeddings to automatically pair up sentences sharing the same meaning in 50 languages.
FB AI models outperformed all other models at WMT 2019, using large-scale sampled back-translation, noisy channel modeling and data cleaning techniques to help build a strong system.
A benchmarking dataset for MT between English and low-resource languages introducing a fair and rigorous evaluation process, starting with 2 languages.
The largest extraction of parallel sentences across multiple languages: Bitext extraction of 135 million Wikipedia sentences in 1,620 language pairs for building better translation models.
The first, single multilingual machine translation model to directly translate between any pair of 100 languages without relying on English data. Trained on 2,200 language directions —10x more than previous multilingual models.
The largest dataset of high-quality, web-based bitexts for building better translation models that work with more languages, especially low-resource languages: 4.5 billion parallel sentences in 576 language pairs.
Creates embeddings to automatically pair up sentences sharing the same meaning in 100 languages.
For the first time, a single multilingual model outperformed the best specially trained bilingual models across 10 out of 14 language pairs to win WMT 2021, providing the best translations for both low- and high-resource languages.
FLORES-101 is the first-of-its-kind, many-to-many evaluation data set covering 101 languages, enabling researchers to rapidly test and improve upon multilingual translation models like M2M-100.
The NLLB model translates 200 languages.
Expansion of FLORES evaluation data set now covering 200 languages
Constructed and released training data for 200 languages
Creates embeddings to automatically pair up sentences sharing the same meaning in 200 languages.
< 50 languages
50-100 languages
100 languages
The first successful exploration of massively multilingual sentence representations shared publicly with the NLP community. The encoder creates embeddings to automatically pair up sentences sharing the same meaning in 50 languages.
FB AI models outperformed all other models at WMT 2019, using large-scale sampled back-translation, noisy channel modeling and data cleaning techniques to help build a strong system.
A benchmarking dataset for MT between English and low-resource languages introducing a fair and rigorous evaluation process, starting with 2 languages.
The largest extraction of parallel sentences across multiple languages: Bitext extraction of 135 million Wikipedia sentences in 1,620 language pairs for building better translation models.
The first, single multilingual machine translation model to directly translate between any pair of 100 languages without relying on English data. Trained on 2,200 language directions —10x more than previous multilingual models.
The largest dataset of high-quality, web-based bitexts for building better translation models that work with more languages, especially low-resource languages: 4.5 billion parallel sentences in 576 language pairs.
Creates embeddings to automatically pair up sentences sharing the same meaning in 100 languages.
For the first time, a single multilingual model outperformed the best specially trained bilingual models across 10 out of 14 language pairs to win WMT 2021, providing the best translations for both low- and high-resource languages.
FLORES-101 is the first-of-its-kind, many-to-many evaluation data set covering 101 languages, enabling researchers to rapidly test and improve upon multilingual translation models like M2M-100.
The NLLB model translates 200 languages.
Expansion of FLORES evaluation data set now covering 200 languages
Constructed and released training data for 200 languages
Creates embeddings to automatically pair up sentences sharing the same meaning in 200 languages.
Acehnese (Latin script)
Arabic (Iraqi/Mesopotamian)
Arabic (Yemen)
Arabic (Tunisia)
Afrikaans
Arabic (Jordan)
Akan
Amharic
Arabic (Lebanon)
Arabic (MSA)
Arabic (Modern Standard Arabic)
Arabic (Saudi Arabia)
Arabic (Morocco)
Arabic (Egypt)
Assamese
Asturian
Awadhi
Aymara
Crimean Tatar
Welsh
Danish
German
French
Friulian
Fulfulde
Dinka(Rek)
Dyula
Dzongkha
Greek
English
Esperanto
Estonian
Basque
Ewe
Faroese
Iranian Persian
Icelandic
Italian
Javanese
Japanese
Kabyle
Kachin | Jinghpo
Kamba
Kannada
Kashmiri (Arabic script)
Kashmiri (Devanagari script)
Georgian
Kanuri (Arabic script)
Kanuri (Latin script)
Kazakh
Kabiye
Thai
Khmer
Kikuyu
South Azerbaijani
North Azerbaijani
Bashkir
Bambara
Balinese
Belarusian
Bemba
Bengali
Bhojpuri
Banjar (Latin script)
Tibetan
Bosnian
Buginese
Bulgarian
Catalan
Cebuano
Czech
Chokwe
Central Kurdish
Fijian
Finnish
Fon
Scottish Gaelic
Irish
Galician
Guarani
Gujarati
Haitian Creole
Hausa
Hebrew
Hindi
Chhattisgarhi
Croatian
Hugarian
Armenian
Igobo
IIocano
Indonesian
Kinyarwanda
Kyrgyz
Kimbundu
Konga
Korean
Kurdish (Kurmanji)
Lao
Latvian (Standard)
Ligurian
Limburgish
Lingala
Lithuanian
Lombard
Latgalian
Luxembourgish
Luba-Kasai
Ganda
Dholuo
Mizo
Foundational models
Latest news
Foundational models