My learnings from training TajBERTo

This blog discusses the author's experience in training the Roberta model for Tajik Languge. Author describes the process of data collection, preprocessing, and training. The author also discusses the challenges of training a model on a low-resource language and the importance of data quality.
Muhtasham Oblokulov
Created on June 3|Last edited on June 15
Comment
﻿
Intro Long time ago in distant galaxy named Google Colab I trained a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads, same number of layers & heads as DistilBERT )– on my native Tajik language. 
I had been interested in trying out W&B for a while, since I had heard good things about it from multiple sources. What really sold me on it were the reporting features and the mobile tracking feature. Being able to easily share detailed results with others is great, and being able to track my progress (and potential problems) while away from my computer is invaluable. So I took some time to clean up the original data set, retrain with W&B, and reflect on my learnings.
The model was trained on the task of masked language modeling, i.e. to predict how to predict arbitrary tokens that we randomly mask in the dataset.
You can fine-tune the model on a downstream task (if you have the data) 
💡
Here are some fun facts about Tajik (ISO 639-1 code tg) language:
 Tajik is a member of the Iranian branch of the Indo-European language family.
 Tajik is written in the Cyrillic script.
 The normal word order in Tajik is Subject-Object-Verb.  Modifiers follow the nouns they modify.
Tale of a datasetIt was easy to find a corpus of text in Tajik.   First I went ahead and used the Tajik portion of the OSCAR corpus from INRIA. OSCAR is a huge multilingual corpus obtained by language classification and filtering of Common Crawl dumps of the Web.
However, after diving deeper, I noticed it was heavily contaminated by other languages, and not only that but also that less than half of the sentences were of acceptable quality.
Automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. 
The OSCAR team used a fastText language classifier, which performs really well and is quite fast but it has a lot of problems, especially with languages that are similar, like Russian/Ukrainian/Belarusian for example.
Also as Caswell et al  have pointed out here, human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for a more robust evaluation.  They find that many language detection methods achieve high accuracy on held-out test sets but perform poorly in practice, and I can certainly relate to their findings.
And also I recommed reading work by Kreutzer et al. here﻿
Apart from that the original fastText classifier was trained on Wikipedia and Tatoeba, which are ok, but ideally one would train this these types of classifiers on far more diverse corpora
Yet another problem is that n-gram models are fast but they have limitations. In the long term, the OSCAR team wants to have some kind of hierarchical language classification with a more specialized (but probably slower) classifier at the deeper levels.
The problem with this approach is that they will really need the help of linguists and native speakers in order to properly structure it, so please join to help them if you are native speaker and care about your language. 
💡
Removing invalid data is a very useful trick I thought that if you spend enough time on improvements, it will be worth the effort. Hence I decided to go help the OSCAR team and myself by curating a better dataset. To this end I went ahead and downloaded multiple versions of the Leipzig Corpora Collection, which is comprised of texts from diverse sources like news, literature, and Wikipedia. Later I had to do some rigorous preprocessing by hard-coding heuristics and regexes and perform the steps below iteratively:
deduplicating
removing curse words 
any political bias
any English character present
removing words which don't exist in tajik
etc
W&B enabled me to monitor the progress of my models in real time while I incrementally added different versions of curated data. 
And indeed, to my surprise, the final version of the custom dataset outperformed other models in terms of loss, but more on that later.  Lastly,  if you also spend extra time to not only remove but also add Lastly,  if you also spend extra time not only on removing data, but also on adding  more examples (see "A Few More Examples It May Be Worth Billions of Parameters") as confirmed by the work of Kirstain et al here, the dataset quality improves.
Training a tokenizerAnd  it is easier than ever to train a new language model from scratch, using amazing   transformer and tokenizer libraries.
I chose to train a byte-level Byte-pair encoding (BPE) tokenizer, with the same special tokens as RoBERTa, with vocabulary size of 52,000. Training a byte-level BPE rather than a WordPiece tokenizer like BERT is a more viable option because it will start building its vocabulary from an alphabet of single bytes (byte-level), so all words will be decomposable into tokens which will lead us to not having <unk> (out-of-vocabulary) tokens. Moreover, compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token.
Train a language model from scratchTraining was done via the transformers library, but I won't go into much detail  as there are plenty tutorials available online.
For evaluation purposes we can start at by looking at the training losses going down. 
As you can see below, especially at earlier steps, as I kept preprocessing the dataset iteratively as described above, the loss decreased as the "better" dataset has less noise in its vocabulary and a better tokenizer.
It wasn't any hyperparameter search etc, just removed noisy training data 🤯
﻿
Run set9
﻿
﻿
For the final run (named final version custom dataset) I remembered the old joke that
To achieve a best possible result, you should gridsearch random seeds.And intialized different seed and let the model run, to my surprise it did pretty well at beginning 
ConclusionThanks for reading! In this post we learned challenges of training a model on a low-resource language and the importance of data quality. Also, if you are interested, you can find the model and dataset at Huggingface.
I enjoyed the capability of W&B to compare different training runs of my models in terms of training loss, seeing learning rates. Another very helpful feature was setting alerts, seeing gradients not wanishing, and tracking parameters on my phone to ensure that training my model went as planned.
﻿
Add a comment