​​Options to reduce training time for Transformers

The purpose of this report is to explore 2 very simple optimizations which may significantly decrease training time on Transformers library without negative effect on accuracy.

We ran 21 experiments + 12 reproducibility experiments on a large well-known NLP dataset (French part of X-NLI), and we show that by simply using an out-of-the-box French BERT model (CamemBERT), default parameters, a single consumer grade GPU, and these optimizations, for base flavor of the model, we can reach, for 128 max token length, in a 16 min training an accuracy of 81.5%, beating by 0.5 points the score obtained with a 56 min training without any optimization, and beating by 0.3 points the score reported for this task by the CamemBERT model authors.

Gains are even more impressive on the same model, for 493 max token length, where training time decreases from 4h38 without any optimization to 1h01 with all optimizations, still reaching the same score. Similar training time reduction have been reached with large model (from 4h to 1h30 for 128 tokens length).

Checkout our Medium article for more details.

We ran many experiments on the French part of X-NLI that we logged on wandb:

In each case (base/large), experiments are separated into 2 groups:

As explained in the article, the 128 tokens setup cause truncation of 3% of the train set examples.

For each of these groups, we consider the combination of 3 optimizations to accelerate training:

If you want to try to run those experiments by yourself, the source code is available on this Github gist.

Training time - base model - batch of 2 steps of 8 sequences of 493 tokens

In the first part of the report, we focus on the base flavor of the model.

When we don't apply any limit on the sequence length, the shortest training time is reached with the 3 options activated: mixed precision, dynamic padding, and smart batching, meaning that each of them is useful.

However, each option may not have the same effect; by far, dynamic padding will have the most impact, which makes sense in a dataset where 97% of sequences are less than 128 sequences in length. Smart batching helps this strategy and is the second most important option.

Training time - base model - batch of 2 steps of 8 sequences of 493 tokens

Training time - base model - a batch of 1 step of 64 sequences of 128 tokens

When we apply a 128 tokens length limit, the shortest training time is again reached with the 3 options activated: mixed precision, dynamic padding, and smart batching.

However, the impact of mixed precision is more important than before.

Full code in Github gist →

Training time - base model - batch of 1 step of 64 sequences of 128 tokens

Accuracy - base model - all setups

We analyze all setups together as no simple trend appears, no matter what grouping we do. It means that speed optimization has no obvious effect on accuracy.

First, we can notice that all scores are between 0.810 and 0.820, a very low gap for so many experiments with various patterns.

It is very possible that if we train models for more than 1 epoch we would notice a larger gap, but XNLI (French part) score reported by Camembert paper for 10 epochs is "only" 0.812.

Best score (0.820) is reached in 2 setups:

There is no good reason to explain that dynamic padding alone helps accuracy, we think that it is just luck.

Other option to check is the impact of smart batching alone compared to setup without any option activated:

Smart batching seems to have a slightly positive impact on accuracy.

Full code in Github gist →

Accuracy - base model - all setups

Trade-off accuracy / training time for Training time - batch of 8X2 sequences 493 tokens

Trade-off accuracy / training time for Training time - batch of 8X2 sequences 493 tokens

Trade-off accuracy / training time for batch of 64 sequences 128 tokens

Trade-off accuracy / training time for batch of 64 sequences 128 tokens

Reproductibility experiments - base model

We rerun 5 times the fastest setting (dynamic padding + smart batching + mixed-precision + 128 max token length) with different seeds. Scores are stable (between 0.813 and 0.819), always above the one reported in Camembert paper (0.812) for this flavor of the model.

Reproductibility experiments - base model

Large model experiments

The second part of the report is dedicated to the large flavor of the model (335M parameters) instead of the base flavor (110M parameters).

In this setup, on the 12Gb of a 2080 TI GPU, the maximum step size is smaller than for the base model:

The 2 optimizations purpose presented in the Medium article focus on batch/step generation.
Because the model is 3X bigger but the GPU size we are using for tests is limited to 12Gb, the step size is smaller.

Without any optimization, training times are very long (15 hours for 493 token lengths, 4 hours for 128 tokens). With all optimizations, training times are reduced, still quite long (7 hours for 493 token lengths, 1h30 for 128 tokens).

Whatever the setup (with or without optimization), the scores obtained here in 1 epoch are slightly lower than the one reported on the paper for 10 epochs + early stopping (best optimized large model reached 0.856 and the paper reports 0.857).

Training time - large model - batch of 2 steps of 8 sequences of 128 tokens

Training time - large model - batch of 2 steps of 8 sequences of 128 tokens

Training time - large model - batch of 4 steps of 2 sequences of 493 tokens

Training time - large model - batch of 4 steps of 2 sequences of 493 tokens

Reproductibility experiments - large model

Reproductibility experiments - large model

Next Steps

We hope you found this article useful and encourage you to run these experiments by yourself, the source code is available on this Github gist.

You can follow me on Twitter @pommedeterre33.