How to Train Your HuggingFace Models Twice As Fast

This article summarizes 14 experiments & 5 reproducibility experiments on 2+1 optimizations using dynamic padding & uniform length batching to reduce training time.
Michaël Benesty
Created on May 14|Last edited on December 13
Comment
The purpose of this article is to explore two very simple optimizations which may significantly decrease training time for HuggingFace models on the Transformers library without a negative effect on accuracy.
Table of ContentsOptions To Reduce Training Time for TransformersTraining Time – Base Model – Batch of 2 Steps of 8 Sequences of 493 TokensTraining Time – Base Model – a Batch of 1 Step of 64 Sequences of 128 TokensAccuracy – Base Model – All SetupsTrade-Off Accuracy/Training Time for Training Time – Batch of 8X2 Sequences 493 TokensTrade-Off Accuracy/Training Time for Batch of 64 Sequences 128 TokensReproducibility Experiments – Base ModelLarge Model ExperimentsTraining Time – Large Model – Batch of 2 Steps of 8 Sequences of 128 TokensTraining Time – Large Model – Batch of 4 Steps of 2 Sequences of 493 TokensReproducibility Experiments – Large ModelNext Steps
﻿
Options To Reduce Training Time for TransformersWe ran 21 experiments and 12 reproducibility experiments on a large well-known Natural Language Processing (NLP) dataset (French part of X-NLI), and we show that by simply using an out-of-the-box French BERT model (CamemBERT), default parameters, a single consumer-grade GPU, and these optimizations, for a base flavor of the model, we can reach, for 128 max token lengths, in a 16 min training accuracy of 81.5%, beating by 0.5 points the score obtained with a 56 min training without any optimization, and beating by 0.3 points the score reported for this task by the CamemBERT model authors.
Gains are even more impressive on the same model, for 493 max token length, where training time decreases from 4h38 without any optimization to 1h01 with all optimizations, still reaching the same score. Similar training time reductions have been reached with the large model (from 4h to 1h30 for 128 tokens length).
Check Out Our Medium Article for More DetailsWe ran many experiments on the French part of X-NLI that we logged on wandb:
base Camembert model: 14 experiments + 5 reproducibility experiments 
large Camembert model: 7 experiments + 5 reproducibility experiments 
In each case (base/large), experiments are separated into 2 groups:
a mini-batch of 64 sequences of max 128 tokens
a mini-batch of 2X8 sequences of max 493 tokens
As explained in the article, the 128 tokens setup causes truncation of 3% of the train set examples.
For each of these groups, we consider the combination of 3 optimizations to accelerate training:
mixed-precision (Nvidia Apex compiled with gcc 7) ;
dynamic padding
smart batching (named uniform length batching in the article, because experiments have been logged with this stupid name, I will keep it for this report).
If you want to try to run those experiments by yourself, the source code is available on this GitHub gist.
Training Time – Base Model – Batch of 2 Steps of 8 Sequences of 493 TokensIn the first part of the report, we focus on the base flavor of the model.
When we don't apply any limit on the sequence length, the shortest training time is reached with the three options activated: mixed precision, dynamic padding, and smart batching, meaning that each of them is useful.  
However, each option may not have the same effect; by far, dynamic padding will have the most impact, which makes sense in a dataset where 97% of sequences are less than 128 sequences in length.
Smart batching helps this strategy and is the second most important option.  
﻿
Run set6
﻿
Training Time – Base Model – a Batch of 1 Step of 64 Sequences of 128 TokensWhen we apply a 128 tokens length limit, the shortest training time is again reached with the 3 options activated: mixed precision, dynamic padding, and smart batching.  
However, the impact of mixed precision is more important than before.  
Mixed precision alone is 4% faster than dynamic padding and smart batching alone.
Mixed precision alone is 34% slower than mixed precision combined with dynamic padding and smart batching.
﻿Full code in Github gist →﻿
﻿
Run set6
﻿
Accuracy – Base Model – All SetupsWe analyze all setups together as no simple trend appears, no matter what grouping we do. It means that speed optimization has no obvious effect on accuracy.
First, we can notice that all scores are between 0.810 and 0.820, a very low gap for so many experiments with various patterns.  
It is very possible that if we train models for more than 1 epoch we would notice a larger gap, but XNLI (French part) score reported by CamemBERT paper for 10 epochs is "only" 0.812.  
Best score (0.820) is reached in 2 setups:
a batch of 64 sequences of 128 tokens with dynamic padding alone
a batch of 16 sequences of 493 tokens with dynamic padding alone
There is no good reason to explain that dynamic padding alone helps accuracy, we think that it is just luck.  
Another option to check is the impact of smart batching alone compared to set up without any option activated:
a batch of 16 sequences of 493 tokens, no option: 0.814
a batch of 16 sequences of 493 tokens with smart batching alone: 0.816
a  batch of 64 sequences of 128 tokens, no option: 0.810
a batch of 64 sequences of 128 tokens with smart batching alone:  817
Smart batching seems to have a slightly positive impact on accuracy.
﻿Full code in Github gist →﻿
﻿
Run set14
﻿
Trade-Off Accuracy/Training Time for Training Time – Batch of 8X2 Sequences 493 Tokens﻿
8X2 sequences 493 tokens7
﻿
Trade-Off Accuracy/Training Time for Batch of 64 Sequences 128 Tokens﻿
﻿
Run set7
﻿
Reproducibility Experiments – Base ModelWe rerun 5 times the fastest setting (dynamic padding + smart batching + mixed-precision + 128 max token length) with different seeds.
Scores are stable (between 0.813 and 0.819), always above the one reported in CamemBERT paper (0.812) for this flavor of the model.
﻿
﻿
Run set6
﻿
Large Model ExperimentsThe second part of the report is dedicated to the large flavor of the model (335M parameters) instead of the base flavor (110M parameters).  
In this setup, on the 12Gb of a 2080 TI GPU, the maximum step size is smaller than for the base model:  
for max 128 token lengths, the step size is 8, we accumulate 2 steps to reach a batch of 16 examples
for max 493 token lengths, the step size is 2, we accumulate 8 steps to reach a batch of 16 examples
The 2 optimizations purpose presented in the Medium article focus on batch/step generation.Because the model is 3X bigger but the GPU size we are using for tests is limited to 12Gb, the step size is smaller.  
Without any optimization, training times are very long (15 hours for 493 token lengths, 4 hours for 128 tokens).
With all optimizations, training times are reduced, still quite long (7 hours for 493 token lengths, 1h30 for 128 tokens).
Whatever the setup (with or without optimization), the scores obtained here in 1 epoch are slightly lower than the one reported on the paper for 10 epochs + early stopping (best optimized large model reached 0.856 and the paper reports 0.857).  
Training Time – Large Model – Batch of 2 Steps of 8 Sequences of 128 Tokens﻿
Run set3
﻿
Training Time – Large Model – Batch of 4 Steps of 2 Sequences of 493 Tokens﻿
Run set3
﻿
Reproducibility Experiments – Large Model﻿
Run set6
﻿
Next StepsWe hope you found this article useful and encourage you to run these experiments by yourself, the source code is available on this Github gist.
You can follow me on Twitter @pommedeterre33.
﻿
Add a comment
Tags: Intermediate, NLP, HuggingFace, Experiment, BERT, Github, Plots, Sweeps
Iterate on AI agents and models faster. Try Weights & Biases today.