Turns out I was modifying the initialization instead of keeping it default. It seems to be much more stable upon commenting out that line of code.
3 layers:
avg_loss
avg_loss
Select runs that logged avg_loss to visualize data in this line chart.
Run set
0
5
Run set
0
Seems like this line got me in trouble:
hidden =(1.0/hidden.shape[1])*torch.tanh(hidden)# Apply tanh function to keep hidden from blowing up after many recurrences, as well as control for magnitude of recurrent connection
I initially put it in to reduce volatility as a result of the stronger recurrent connection. I had only put it in the new algorithm though, not the backprop one. Check out how changing it to this helped:
hidden = torch.tanh(hidden)# Apply tanh function to keep hidden from blowing up after many recurrences
Run set
0
another big sweep just to see
Run set
0
test out candecay values. they still make no difference.
Run set
0
tinker with lr and plr, changing code to allow for more immediate changes, ignoring plasticity this time.
Run set
0
Does the batch size screw things up? Perhaps it gets everything stuck in local minima. Maybe a small batch will let us break out, encouraging exploration.