look at norms
Created on June 2|Last edited on June 17
Comment
Rough hypotheses:
- We'll see norms rise after the accuracy starts to fall in some models, possibly indicating exploding gradients. yep! validated. When accuracy begins to fall, so are both the update norms increasing. Interestingly, the weight norms themselves plateau. Perhaps that makes sense. Nothing is being learned!
- Good models will have big differences btw norms of high and low weights. Not validated! A model that's doing better seems to have a shrinking ratio of norms of high to low-plast weights, as well as a of norms of the gradients. A growing ratio, stagnant for gradients/updates, seems to indicate a loss climb. It's good for them to shrink. And the trend in that ratio of norms happens well before the result. The gap, rather than ratio, actually does continue to grow in a good model, but just for the weight norms. For the gap between high and low gradient norms, we see shrinking to be ideal, and stagnation to be the consequence of gradient explosion.
- bottom line: a model that has a climbing ratio of high to low-plasticity weight norms will eventually explode. it ought to shrink in a healthy model. As of 40M iterations. Let's see how the idea translates to the longer run, and to other runs.
Run set
47
Run set 2
51
Add a comment