Grad norm clipping
Created on March 23|Last edited on March 23
Comment
Losses and L0 are almost identical for layer 6.
Grad norm kills a few more features for some reason.
It’s basically a wash overall. I’m fine doing whatever for tinystories. Can keep grad norm in for current runs, maybe if we do a final set of runs we can remove them.
Gpt2 can get much larger grad norms, so we may want to continue clipping, but perhaps should do so at a much larger threshold.
Section 1
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Add a comment