comparing learning rates of the main weights and the plasticity.
loss
loss
Select runs that logged loss to visualize data in this line chart.
Run set
0
with a hidden size of 256 instead of 128.
Run set
0
really narrow down what the best learning rates pairing is. Looks like it's at least a 1e-5 or 5e-5 lr, and a 1e-7 plr.
Run set
0
Now, let's explore some finer tunes, as well as deeper layers. We want it to learn fast, but not diverge/explode later. This one at 256 hidden size.
Run set
0
Run set
0
512 hidden size
Run set
0
some random stuff - experiment with smaller batch sizes
Run set
0
some rudimentary experimentation with code changes
Run set
0
more discipline. I allow a smaller clip for the plasticity values, as well as a larger one. I want to know whether it improves anything, like stability.