Meta Releases NLLB-200, New Open-Source Model Able To Translate 200 Languages
NLLB-200 is a model able to translate between 200 languages, including those often neglected. Though the development process posed many challenges, NLLB-200 has released open-source along with many of the tools which were used to make it.
Created on July 6|Last edited on July 7
Comment
Meta AI has revealed a new model today called NLLB-200 (No Language Left Behind), a new 54-billion parameter translation model.
The model proports to be able to translate between 200 different languages at a high quality, including commonly spoken languages and many languages which are often forgotten or neglected.
Not only is NLLB-200 open-sourced along with many of the tools and data created during the process, but Meta AI is encouraging researchers to work with and expand the capabilities of the model and the resources surrounding it. Up to $200,000 in grants to nonprofit organizations will be offered for demonstrating real-world application of the model. NLLB-200 is downloadable at the GitHub in a range of sizes from 600M to 54.5B parameters.
A partnership with the Wikimedia Foundation will bring NLLB-200 into the hands of those who help to translate the millions of articles present on Wikipedia every day. NLLB-200 will now be present in the Content Translation Tool used for such translations.
If you'd like the take see what NLLB-200 is capable of, check out the Stories Told Through Translation project wherein the model is used to translate a growing collection of stories from around the world.
Before I dive into the model itself, let's take a minute to watch the short video that Meta put out that discusses the problem they are trying to solve with NLB-200, and their solution.
The challenges in creating NLLB-200
Due to the large list of languages NLLB-200 supports, there were many challenges that the researchers had to contend with.
Generating translation data for 200 languages
With so many languages, especially the low-resource languages which make up the key appeal of this model, it is difficult to get all the data needed for training. Typically, language translation models are trained on huge datasets of human-translated language that are readily available and often open-source. Datasets for common language pairs like English and Spanish are easy to come by, but what about English and Fula?
One solution is to effectively scrape the internet for as much low-resource language content as possible, and work with it as best as you can despite it's flaws. Many filtering steps were introduced to this catch-all data collection process to ensure high-quality data would be used.
As part of creating NLLB-200, the researchers also improved their LASER toolkit to generate billions of parallel sentences from this data in all the different language pairs to be used for model training. LASER3, the newer version, along with the generated language pairs, will all be open-sourced.
Managing 200 languages in a single model
Fitting 200 language into a single model seems like a crazy idea, but that's exactly what the researchers did. Because it's a single model, all translation tasks will share the same collection of weights that define the model. This is not only an issue for tuning the model in the first place, but because so many languages are low-resource languages, if training isn't done carefully, the model could slip into overfitting for the high-resource languages.
To mitigate overfitting, a mixture-of-experts approach was taken to make sure that low-resource languages get a fair capacity allocation despite their much smaller amount of training data. Another step taken is to train high-capacity languages into the model first, and then bring the low-capacity languages in afterwards. Many more steps were taken to ensure NLLB-200 was able to produce high quality translations across all involved languages.
Evaluating translations across 200 languages
Evaluation is the final step of any model-building process, and here we're stuck vetting the translation quality of 200 languages. Even if translation output appears fine at first glance, translation models are often prone to hallucination, or the confident generation of incorrect language.
To evaluate such a complex model, the researchers developed FLORES-200, an updated version of FLORES, an evaluation benchmarking solution for English to low-resource language translation. A toxicity list was created to support the evaluation process, which will also be released for further research use.
How does NLLB-200 work?
NLLB-200 works much like any other translation model, except that it fits a whopping 200 languages into a single model. Despite the hurdles present in developing such a model, NLLB-200 shows increased BLEU benchmark scores compared to other 200-language models, as well as averaged 44% score improvement compares to other models on the FLORES-101 benchmark.
Find out more
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.