yepster

Adafactor learning-rate 0.005 seems best for t5-base training

TL;DR: In a comparison between Adafactor, AdamW and Distributed Shampoo for training t5-base, Adafactor without gradient accumulation seems to be converge fastest. Due to a difference in which the optimizer is logged in two scripts I am using, I cannot change the legend to show the optimizer without displaying false information. All the runs that converge below loss 4 after a ~2 hours are Adafactor runs without gradient accumulation. Peach-sweep-6 is the best of these, with lr set to 5e-3

yepster

2022-02-19

3 years ago

pjit script results comparable with pmap script

Comparing a training run of t5-base training with adafactor by two different scripts. Notes: * Green: pjit script, blue: pmap script * The pjit script does not average training loss between each log, so appears more jagged * The pmap script is slightly faster. Maybe due to a suboptimal model partitioning definition.

yepster

2022-02-19

4 years ago