Skip to main content
Reports
Created by
Created On
Last edited
Adafactor learning-rate 0.005 seems best for t5-base training
TL;DR: In a comparison between Adafactor, AdamW and Distributed Shampoo for training t5-base, Adafactor without gradient accumulation seems to be converge fastest. Due to a difference in which the optimizer is logged in two scripts I am using, I cannot change the legend to show the optimizer without displaying false information. All the runs that converge below loss 4 after a ~2 hours are Adafactor runs without gradient accumulation. Peach-sweep-6 is the best of these, with lr set to 5e-3
0
2022-02-19
pjit script results comparable with pmap script
Comparing a training run of t5-base training with adafactor by two different scripts. Notes: * Green: pjit script, blue: pmap script * The pjit script does not average training loss between each log, so appears more jagged * The pmap script is slightly faster. Maybe due to a suboptimal model partitioning definition.
0
2022-02-19