Efficiently Upgrading Multilingual Machine Translation Models to Support More Languages

Empirical study figuring out what methods best help a language model learn more languages.
Created on February 9|Last edited on February 9
Comment
﻿
﻿
This paper is about the difficulties of adding more languages to an existing language (or multilingual) model and their approach to this problem. The authors claim their approach saves computation, avoids catastrophic forgetting, and is on par performance-wise.
The method outlined is more of an empirical study of what combination of existing approaches works best. 
Their results show 3 strong indicators for scaling up language-wise:
careful weight initialization
learning rate scaling
data up-sampling
Weight InitializationIn the paper, the authors tested scaling the model deeper and wider (deeper as in more layers and wider as in larger hidden dimensions). They note that it is unclear how beneficial it is to simply copy weights from the old model to the new one while also allowing for "maximal knowledge about the old directions to be attained". In other words, how do we carry the trained weights over and retrain on more languages while also keeping the model from forgetting what it learned?
They found the following methods to work (compared to a naive baseline of, I'm assuming, just a copy from the old model to the new model):
initialize embeddings with <unk> token
in the case of a wider model, concatenating old weight matrix to new weight matrix 
in the case of a deeper model, new layers = average of weights of previous layers
Data Up-SamplingThey want to preserve the model's learned information but also add new information. This is the task data up-sampling tackles. They found the following to work:
mix old and new data 
upsample new data so the model can receive more signals from that
upsample old data where the data language family matches the new data languages families
This makes intuitive sense! Similar languages from the old data would definitely help a new model learn a language of a similar flavor.
Learning Rate ScalingIn addition to careful weight initialization, they scale the learning rate depending on which parts of the model is old or new. This implies that the new model will be guaranteed to have a larger representation capacity.
They gave the old weights copied over a low learning rate and the new layers or weights a higher learning rate to be the best. Basically, the model doesn't edit its already-learned knowledge base much, but it will learn faster for its untrained neurons.
Results
﻿
This table has a lot of words! In summary, 
Rows are the 2 metrics they use
"Orig" is performance on old data
"Added" is performed on new data
First column: M20 
baseline model trained on 20 languages, with no additional languages
2nd Column (and 3rd and 4th): M25 & Mt25
standard transformer trained on 25 languages from scratch (M25)
standard transformer trained to support the added languages (Mt25)
@##k where ## could be 30 or 100 is the number of updates/iterations
the 3rd and 4th columns are exactly the same but with their suggested architectural changes (wide and deep, respectively)
Within each column, compare the metrics column-wise. You will see that their results show that, with their methods, they can match (if not exceed) the performance of a model trained from scratch in only 30% or 50% of the required computation.
This was the gist of this study: 3 established methods can help scale a language model to more languages while saving a whole lot of time in computation!
Reference:﻿https://arxiv.org/pdf/2302.03528.pdf﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.