These are all using the new long_range_memory_dataset, and my 'candidate' learning rule.
Is one layer or two better?
Select runs that logged avg_loss
to visualize data in this line chart.
should it have a high or low hidden size?
seems like lower is better for these fast iterations.
What's the ideal learning rate, here?
really seems like I can keep going smaller.
Does my learning rule gain an advantage in deeper networks?
actually compare to backprop
Does backprop really just need higher learning rates?
Try different decay rates
Backprop v. Wackprop: Ultimate Matchup
Heroes