Skip to main content

Learnable Plasticity

Created on June 21|Last edited on June 22
The same exact learning process (minus stochasticity, i guess) is applied to a new value called 'plasticity', which is unique to each weight and applies to it just the same as the learning rate. So it's like each weight has its own personal learning rate.

Here, the lr is set to 1e5 and the plasticity learning rate is 1e6. I guess it makes sense that 512 hidden size explodes. I think it was, before, at that lr. I also don't clip the plasticity, so it seems volatile. It shouldn't really be very big, and never negative.

Computing group metrics from first 10 groups
05001k1.5k2k2.5k3kStep0.80.9123456789102030
plastic_candidate 4096 3
static_plastic_candidate 2048 16
static_plastic_candidate 2048 8
plastic_candidate 2048 5
static_plastic_candidate 2048 3
plastic_candidate 2048 3
candidate 1028 3
plastic_candidate 1024 15
static_plastic_candidate 1024 10
Run set
6032

At a hidden size of 128 and 3 layers, I experiment with different plasticity clip values. Of course, the low bound is 0.1. I'd like to know how it influences the loss descent, and if it can prevent explosion.
I use the clip value of 0 to differentiate it here, but it's not used in this case, since I also change the update rule to candidate. I just want to have a no-plasticity-change benchmark to look at on this graph.

Run set
24

Let's assume that 2 is the best clip for plasticity levels. So, a given weight will never be more than twice as plastic as the initialization.
How does the balance between the learning rate and the plasticity learning rate work, especially now that we have a clip?

Run set
33