Train HuggingFace models twice as fast


Options to reduce training time for Transformers

The purpose of this report is to explore 2 very simple optimizations which may significantly decrease training time on Transformers library without negative effect on accuracy.

We ran 21 experiments + 12 reproducibility experiments on a large well-known NLP dataset (French part of X-NLI), and we show that by simply using an out-of-the-box French BERT model (CamemBERT), default parameters, a single consumer grade GPU, and these optimizations, for base flavor of the model, we can reach, for 128 max token length, in a 16 min training an accuracy of 81.5%, beating by 0.5 points the score obtained with a 56 min training without any optimization, and beating by 0.3 points the score reported for this task by the CamemBERT model authors.

Gains are even more impressive on the same model, for 493 max token length, where training time decreases from 4h38 without any optimization to 1h01 with all optimizations, still reaching the same score. Similar training time reduction have been reached with large model (from 4h to 1h30 for 128 tokens length).

Read the full post →

Join our mailing list to get the latest machine learning updates.