Task-Adaptive Pretraining

This report visualizes the continued pretraining of RoBERTa.

Created on August 28|Last edited on September 16

Comment

﻿
This report illustrates the adaptation of the RoBERTa encoder for the English texts from the BiPaR dataset. More precisely, RoBERTa was trained on texts using the masked language modeling (MLM) objective. Two different approaches were implemented:
roberta-large_40-tokens was trained by masking 40% of tokens.
roberta-large_20-words was trained by masking 20% of words. 
﻿
Training﻿
﻿
Perplexity
Perplexity
5001k1.5k2k2.5k3kStep3456
roberta-large_20-words
roberta-large_40-tokens
Training Loss vs. Validation Loss
Training Loss vs. Validation Loss
5001k1.5k2k2.5k3kStep11.52
roberta-large_20-words train/loss
roberta-large_20-words train/avg_loss
roberta-large_20-words eval/avg_loss
roberta-large_40-tokens train/loss
roberta-large_40-tokens train/avg_loss
roberta-large_40-tokens eval/avg_loss
﻿
The perplexity and loss of the model trained on 40% masked tokens were higher than those of the model trained on 20% masked words.
Since RoBERTa had already been pretrained on stories, its perplexity changed only slightly during continued pretraining on the corpus generated from BiPaR passages.
Over 20 training epochs, both approaches  overfitted only slightly on the BiPaR passages, as shown by the loss curves.
﻿

Add a comment