Sweat the small network. 512 hidden size and 5 layers.
avg_loss
avg_loss
Select runs that logged avg_loss to visualize data in this line chart.
Run set
0
Let's just temporarily make plast_clip be a lower-bound clip. I noticed some interesting dynamics with it. For reference, lr and plr are fixed at 1e-3, and the above graph has the lower-bound clip on plasticity at 1e-7.
Run set
0
big version of that first set, with lower-bound set back to 1e-7, but 15 layers and hidden size of 1024.