Sweep LR for 150m -> best LR = 3e-3
Select runs that logged train/loss
to visualize data in this line chart.
Sweep LR for 150m MuP -> best LR = 1e-3
Note that the losses here are much higher than the sweep from SP above.
SP vs. MuP vs. heuristic SP on 1.4b (scaling factor = 2048/512=4)
heuristic SP uses 0.25x best LR from the sweep before
mup gets pretty much the same loss but with some loss spikes during training.
This means our heuristic works fine and it's probably not worth it to introduce another complexity into our system (mup) for now.