Training Devanagari Language Models on TPU using Hugging Face and PyTorch

This report describes a complete bottom-up approach, using Hugging Face, to training models for Devanagari languages like Marathi, Hindi, and Sanskrit.
Darshan Deshpande


Devanagari is a script used by the most commonly spoken languages in India, including (but not limited to) Marathi, Hindi, Sanskrit, Gujarati, and Punjabi. This report describes how to train a DistilBERT model for Marathi from scratch using Hugging Face and touches upon the necessary tokenization and hyperparameter tuning specifics. The information and techniques presented in this report can be extended to all Devanagari languages.
The code for this report can be found in this Colab Notebook.
Our completely scaled, current SoTA version of DistilBERT for Marathi is available on Hugging Face.
You can try it here.


Tokenization is the process of converting words and character splittings into machine-understandable tokens. Let us look at some Marathi examples for each of the tokenizers in the Hugging Face Tokenizers library. DistilBertTokenizerFast (from the Transformers library), which we will use later on, is backed by BertWordPieceTokenizer.
BertWordPieceTokenizer uses the WordPiece token splitting algorithm, which breaks words into sub-words to optimize vocabulary and cover most, if not all, possible textual occurrences of a word from outside the training text. Let us visualize a sample of how the algorithm works on Marathi.
tokenizer = tokenizers.BertWordPieceTokenizer(strip_accents=False)tokenizer.train('Train.txt', vocab_size=30000)tokenizer.encode('हें मुख्य यांचे बोलण्यांतील रहस्य').tokens------------------------------------------------------OUTPUT['हें', 'मुख्य', 'यांचे', 'बोल', '##ण्यांत', '##ील', 'रहस्य']
Notice how बोलण्यांतील (Bōlaṇyāntīla) is split into बोल(Bōla), ##ण्यांत (ṇyānt) and ##ील (īla). Here ## is a prefix token that indicates a split. The strip_accents=False parameter is essential for Devanagari languages. If not included, the tokenizer will remove the diacritics.
2. ByteLevelBPETokenizer
ByteLevelBPETokenizer uses byte-level Byte-Pair Encoding (BPE), meaning that it is a universal tokenizer that can tokenize any language without requiring the unknown token. Neural Machine Translation with Byte-Level Subwords (Wang et al., 2019) mentions that byte-level BPE is almost 1/8th the size of a regular BPE tokenizer. It is able to achieve this by representing the sentence as a sequence of 248 (out of 256) possible UTF-8 bytes which are then split into variable-length n-grams. Let us visualize a sample:
tokenizer = tokenizers.ByteLevelBPETokenizer()tokenizer.train('Train.txt', vocab_size=30000)tokenizer.encode('हें मुख्य यांचे बोलण्यांतील रहस्य').tokens------------------------------------------------------OUTPUT['ह','à¥ĩà¤Ĥ','Ġम','à¥ģ', 'à¤ĸ', 'à¥į', 'य', 'Ġय', 'ाà¤Ĥ', 'à¤ļ', 'à¥ĩ', 'Ġब', 'à¥ĭ','लण', 'à¥į', 'य', 'ाà¤Ĥ', 'त', 'à¥Ģ', 'ल', 'Ġरहस', 'à¥į', 'य']
The tokens are simply unreadable bytes to humans, but store split patterns to easily decode the text.
3. CharBPETokenizer
CharBPETokenizer represents the BPE algorithm as introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). The default settings correspond to OpenAI GPT BPE tokenizer and differ from the original Subword-NMT implementation by two extra parameters: bert_normalizer and split_on_whitespace_only. These can be disabled to achieve tokenization similar to the paper mentioned above.
tokenizer = tokenizers.CharBPETokenizer()tokenizer.train('Train.txt', vocab_size=30000)tokenizer.encode('हें मुख्य यांचे बोलण्यांतील रहस्य').tokens------------------------------------------------------OUTPUT['हें', 'मुख्य', 'यांचे', 'बोलण्या', 'ंतील', 'रहस्य']
Like BertWordPieceTokenizer, words are also split into multiple tokens in the default implementation of CharBPETokenizer. The token represents the word suffix and signifies the end of the word split.


Till now, we have seen the differences between some of the commonly used tokenizers in the Tokenizers library. However, since we will be training a DistilBert model from the Transformers library, we should also use a Transformers tokenizer for compatibility.
As it turns out, DistilBertTokenizerFast (a Transformers tokenizer) is backed by BertWordPieceTokenizer (a Tokenizers tokenizer), meaning that we can still "use" BertWordPieceTokenizer.
DistilBertTokenizerFast/BertWordPieceTokenizer uses Rust, and accelerates the tokenization process considerably.

Preprocessing and On-the-fly Tokenization

We will be using a subset of the Marathi OSCAR Corpus for our training experiments. The subset contains 500,000 line-by-line sentences of Marathi and is small but will suffice for our experiments. We will be using a Colab TPU v2-8 for our trials to make training faster and to explain some external optimizations that can be implemented to make the code more efficient.
The Transformers' script loads the model 8 times, once each for a different TPU core, but this is extremely wasteful and will cause our Colab environment to run out of memory when training with a slightly bigger corpus. So we will modify the code to load the model only once and use it throughout the training by instantiating the model outside the map function, which is then called by xla.spawn(). The model must be moved to the device (TPU) inside the map function.
To do this we first define our model configuration for a DistilBERT model for Masked Language Modeling (MLM). Tweaks can be made to DistilBertConfig but we will let it be as it is for now. Note that the model is wrapped using the xmp.MpModelWrapper, which loads the model only once, in the global scope, which is then moved into each device inside the xla.spawn() function. This function also serializes the movement of the model weights into each device, which lowers the load on the system memory during the process.
from transformers import DistilBertConfig, DistilBertForMaskedLMconfig = DistilBertConfig(vocab_size=30000)model = xmp.MpModelWrapper(DistilBertForMaskedLM(config))def map_fn(index): device = xm.xla_device() ... ...if __name__ == "__main__": xmp.spawn(map_fn, args=(), nprocs=8, start_method='fork')
Now we create a helper function to load and tokenize our dataset.
# Setting do_lower_case=False to avoid unnecessary issues with splittingtokenizer = DistilBertTokenizerFast.from_pretrained("/content/Tokenizer", do_lower_case=False)def get_tokenized_dataset(): data_files = {"train": "/content/Train.txt"} tokenized_datasets = datasets.load_dataset('text', data_files=data_files) def tokenize_function(examples): # Remove empty lines examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()] return tokenizer( examples["text"], padding="max_length", truncation=True, max_length=128, return_special_tokens_mask=True, ) return tokenized_datasets.with_transform(tokenize_function)
We define data_files, which contains the file names for the train split. Validation and test set key-value pairs can also be added if needed. This is then loaded as a Hugging Face dataset.
A tokenize_function is created to tokenize the dataset line by line. The with_transform function is a new addition to the Datasets library and maps the dataset on-the-fly, instead of mapping the tokenized dataset to physical storage using PyArrow. This is helpful for our case where both RAM and storage are limited. Set padding="max_length" when training on TPU; you can pad dynamically only if you wish to train on CPU/GPU/Multi-GPU configurations.
Before feeding our tokenized dataset to the trainer we need to use a DataCollator, more specifically, a DataCollatorForLanguageModeling. This DataCollator randomly masks tokens used for the MLM training.
def get_data_collator(): return DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
The mlm_probability parameter defines the ratio of tokens that are to be masked for language modelling loss.
Finally, we define our training arguments and trainer inside our map function:
args = TrainingArguments(output_dir="/content/TPUCheckpoints", do_train=True, per_device_train_batch_size=32,weight_decay=0.01, num_train_epochs=3, save_total_limit=2, save_steps=500, disable_tqdm=False, remove_unused_columns=False)trainer = Trainer( model=model, args=args, train_dataset=tokenized_datasets["train"], tokenizer=tokenizer, data_collator=data_collator,)trainer.train()
The remove_unused_columns=False parameter is necessary for Transformers version 4.4.0dev to avoid problems with lazy loading using with_transform. This will be fixed in the upcoming commits.
We will checkpoint the model after every 500 steps and use a batch size of 32 for each TPU core. An optional weight decay of 0.01 is also added. These parameters are freely tweakable.
Putting it all together, the complete training code looks like this:
import datasetsfrom transformers import (DataCollatorForLanguageModeling, Trainer, TrainingArguments, DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizerFast)import torch_xla.core.xla_model as xmimport torch_xla.distributed.parallel_loader as plimport torch_xla.distributed.xla_multiprocessing as xmpconfig = DistilBertConfig(vocab_size=30000)model = xmp.MpModelWrapper(DistilBertForMaskedLM(config))SERIAL_EXEC = xmp.MpSerialExecutor()tokenizer = DistilBertTokenizerFast.from_pretrained("/content/Tokenizer", do_lower_case=False)def get_tokenized_dataset(): data_files = {"train": "/content/Train.txt"} tokenized_datasets = datasets.load_dataset('text', data_files=data_files) def tokenize_function(examples): # Remove empty lines examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()] return tokenizer( examples["text"], padding="max_length", truncation=True, max_length=128, return_special_tokens_mask=True, ) return tokenized_datasets.with_transform(tokenize_function)def get_data_collator(): return DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)def map_fn(index): device = xm.xla_device() xm.rendezvous("Model moved to device") # Defining arbitrary training arguments args = TrainingArguments(output_dir="/content/TPUCheckpoints", do_train=True, per_device_train_batch_size=32,weight_decay=0.01, num_train_epochs=3, save_total_limit=2, save_steps=500, disable_tqdm=False, remove_unused_columns=False, ignore_data_skip=False) tokenized_datasets = xm.rendezvous("Tokenized dataset loaded") data_collator = xm.rendezvous("DataCollator loaded") trainer = Trainer( model=model, args=args, train_dataset=tokenized_datasets["train"], tokenizer=tokenizer, data_collator=data_collator, ) trainer.train()if __name__ == "__main__": xmp.spawn(map_fn, args=(), nprocs=8, start_method='fork')
Apart from the MpModelWrapper, we also use a xmp.MpSerialExecutor to help execute the functions serially. Though this is not necessary, it has been included for reference.

Training Hyperparameters and Experiments

The choice of training hyperparameters is usually purely stochastic but there are a few configurations that might work well. The model performs slightly better when the mlm_probability for the DataCollator is changed from its default value of 0.15 to 0.2. This change is more significant for smaller datasets but is worth experimenting with.
Using a linear warmup for 500 to 1000 steps generally helps in better mapping of the gradient landscape. Along with the warmup, a linear or a polynomial scheduler for the learning rate for training is highly recommended.
The DistilBERT model, despite being an optimized version of BERT, still has a large number of tunable hyperparameters. If time and resources are limited, tweaking only the learning rate is often the best option. For TPU-based training with large batch sizes, it is suggested that the learning rate should also be increased by a factor of 0.05. The tweaking of the learning rate alone is sufficient to get much better results, as displayed by the graph:
The final factor to touch upon is the usage of computational resources throughout the training.
As expected, the RAM is constantly used for on-the-fly tokenization with observable peaks during checkpoint saving. A similar trend is seen in the TPU utilization graph, where the usage only dips during checkpoints.


After the model is sufficiently trained, we can test it out by using a Hugging Face pipeline. Pipelines contains support for different tasks. Since we are interested in masked language modelling, we will use the fill-mask pipeline to create a fill_mask() function.
from transformers import pipelinefill_mask = pipeline( "fill-mask", model="/content/TrainingCheckpoints/", tokenizer="/content/TrainingCheckpoints/",)
We pick a random test sentence and replace one of the words with a [MASK] token, which is the default masking token for the BertWordPieceTokenizer. If you are unsure about the mask token then you can find it using fill_mask.tokenizer.mask_token. Now all that is left to do is to simply call fill_mask with our masked input sentence and get our predictions:
print("Input sentence: ही कथा [MASK] आहे.")print(fill_mask(f"ही कथा {fill_mask.tokenizer.mask_token} आहे."))--------------------------------------------------OUTPUTInput sentence: ही कथा [MASK] आहे.[{'score': 0.07070279866456985, 'sequence': 'ही कथा प्रसिद्ध आहे.', 'token': 2068, 'token_str': 'प्रसिद्ध'}, {'score': 0.04407655820250511, 'sequence': 'ही कथा वेगळी आहे.', 'token': 3852, 'token_str': 'वेगळी'}, {'score': 0.03667418286204338, 'sequence': 'ही कथा लिहिली आहे.', 'token': 8207, 'token_str': 'लिहिली'}, {'score': 0.030504731461405754, 'sequence': 'ही कथा खरी आहे.', 'token': 4196, 'token_str': 'खरी'}, {'score': 0.02702283300459385, 'sequence': 'ही कथा चुकीची आहे.', 'token': 10251, 'token_str': 'चुकीची'}]
To change the number of predictions displayed, the top_k parameter of the fill_mask function can be changed to the desired integer value.
A Colab notebook for running inference on our SoTA model can be found here.


Training a large model like DistilBERT or BERT for language modelling is often an expensive and time-consuming task. This report tries to solve some of the memory and space optimization problems while training a DistilBERT-based MaskedLanguageModel. The report can be summarized in the following points:
  1. Comparisons between Hugging Face tokenizers on Marathi text.
  2. Optimization of PyTorch code for TPUs.
  3. Useful tips for hyperparameter tuning
The use of W&B or Tensorboard for visualization is extremely helpful in situations where keeping track of training progress is essential.
I hope this report has given some insights on how to train a Devanagari language model for masked predictions with minimal memory and disk space requirements using Hugging Face transformers.
Extended thanks to Tylan Bilal for helping me out with the optimization of torch_xla code.


  1. HuggingFace (Thomas Wolf et al., 2020)
  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Jacob Devlin et al., 2018)
  3. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (Victor Sanh et al., 2019)
  4. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (Yonghui Wu et al., 2016)
  5. Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)