The purpose of this report is to explore 2 very simple optimizations which may significantly decrease training time on Transformers library without negative effect on accuracy.
We ran 21 experiments + 12 reproducibility experiments on a large well-known NLP dataset (French part of X-NLI), and we show that by simply using an out-of-the-box French BERT model (CamemBERT), default parameters, a single consumer grade GPU, and these optimizations, for base flavor of the model, we can reach, for 128 max token length, in a 16 min training an accuracy of 81.5%, beating by 0.5 points the score obtained with a 56 min training without any optimization, and beating by 0.3 points the score reported for this task by the CamemBERT model authors.
Gains are even more impressive on the same model, for 493 max token length, where training time decreases from 4h38 without any optimization to 1h01 with all optimizations, still reaching the same score. Similar training time reduction have been reached with large model (from 4h to 1h30 for 128 tokens length).
We ran many experiments on the French part of
X-NLI that we logged on
baseCamembert model: 14 experiments + 5 reproducibility experiments
largeCamembert model: 7 experiments + 5 reproducibility experiments
In each case (
large), experiments are separated into 2 groups:
As explained in the article, the 128 tokens setup cause truncation of 3% of the train set examples.
For each of these groups, we consider the combination of 3 optimizations to accelerate training:
If you want to try to run those experiments by yourself, the source code is available on this Github gist.
In the first part of the report, we focus on the
base flavor of the model.
When we don't apply any limit on the sequence length, the shortest training time is reached with the 3 options activated: mixed precision, dynamic padding, and smart batching, meaning that each of them is useful.
However, each option may not have the same effect; by far, dynamic padding will have the most impact, which makes sense in a dataset where 97% of sequences are less than 128 sequences in length. Smart batching helps this strategy and is the second most important option.
When we apply a 128 tokens length limit, the shortest training time is again reached with the 3 options activated: mixed precision, dynamic padding, and smart batching.
However, the impact of mixed precision is more important than before.
We analyze all setups together as no simple trend appears, no matter what grouping we do. It means that speed optimization has no obvious effect on accuracy.
First, we can notice that all scores are between 0.810 and 0.820, a very low gap for so many experiments with various patterns.
It is very possible that if we train models for more than 1 epoch we would notice a larger gap, but XNLI (French part) score reported by Camembert paper for 10 epochs is "only" 0.812.
Best score (0.820) is reached in 2 setups:
There is no good reason to explain that dynamic padding alone helps accuracy, we think that it is just luck.
Other option to check is the impact of smart batching alone compared to setup without any option activated:
Smart batching seems to have a slightly positive impact on accuracy.
We rerun 5 times the fastest setting (dynamic padding + smart batching + mixed-precision + 128 max token length) with different seeds. Scores are stable (between 0.813 and 0.819), always above the one reported in Camembert paper (0.812) for this flavor of the model.
The second part of the report is dedicated to the
large flavor of the model (335M parameters) instead of the
base flavor (110M parameters).
In this setup, on the 12Gb of a 2080 TI GPU, the maximum step size is smaller than for the
The 2 optimizations purpose presented in the Medium article focus on batch/step generation.
Because the model is 3X bigger but the GPU size we are using for tests is limited to 12Gb, the step size is smaller.
Without any optimization, training times are very long (15 hours for 493 token lengths, 4 hours for 128 tokens). With all optimizations, training times are reduced, still quite long (7 hours for 493 token lengths, 1h30 for 128 tokens).
Whatever the setup (with or without optimization), the scores obtained here in 1 epoch are slightly lower than the one reported on the paper for 10 epochs + early stopping (best optimized
large model reached 0.856 and the paper reports 0.857).