static plasticity, lr 1e-3, 5 layers, 1024 parameters in each layer. The main difference is the dataset, it uses long_range_memory_dataset, which is mostly zeroes, but has a pattern where the model needs to store a value to be accessed later.
The goal is to see if this larger model manages to correctly store and access info, only provided once before.
avg_loss
avg_loss
Select runs that logged avg_loss to visualize data in this line chart.
Run set
0
Same dimensions, different dataset, different lr. Uses baby names dataset. I'd like to see whether the slower-learning parameters manages to preserve important info, while still allowing the model to adapt to the current context of the name being predicted. This model ought to do better than a baseline that has uniform plasticity values. Some may have a higher range of possible plasticity values, i forget. "worldly sea" is the uniform baseline. as well as "morning-star".
Run set
0
let's try adding in some EMA of the weights...
Run set
4701
ok now, since we see some results, let's let it change some of that plasticity. (no EMA) so we're just changing the plasticity learning rate here.