Learnable Plasticity

Created on June 21|Last edited on June 22
Comment
The same exact learning process (minus stochasticity, i guess) is applied to a new value called 'plasticity', which is unique to each weight and applies to it just the same as the learning rate. So it's like each weight has its own personal learning rate.
﻿
Here, the lr is set to 1e5 and the plasticity learning rate is 1e6. I guess it makes sense that 512 hidden size explodes. I think it was, before, at that lr. I also don't clip the plasticity, so it seems volatile. It shouldn't really be very big, and never negative. 
﻿
avg_loss
avg_loss
Computing group metrics from first 10 groups
05001k1.5kStep0.70.80.9123456789102030
   1024   3
Run set4560
﻿
At a hidden size of 128 and 3 layers, I experiment with different plasticity clip values. Of course, the low bound is 0.1. I'd like to know how it influences the loss descent, and if it can prevent explosion. 
I use the clip value of 0 to differentiate it here, but it's not used in this case, since I also change the update rule to candidate. I just want to have a no-plasticity-change benchmark to look at on this graph. 
﻿
Run set0
﻿
Let's assume that 2 is the best clip for plasticity levels. So, a given weight will never be more than twice as plastic as the initialization. 
How does the balance between the learning rate and the plasticity learning rate work, especially now that we have a clip? 
﻿
Run set0
﻿
﻿
Add a comment