Skip to main content

Grad norm clipping

Created on March 23|Last edited on March 23
Losses and L0 are almost identical for layer 6.
Grad norm kills a few more features for some reason.
It’s basically a wash overall. I’m fine doing whatever for tinystories. Can keep grad norm in for current runs, maybe if we do a final set of runs we can remove them.
Gpt2 can get much larger grad norms, so we may want to continue clipping, but perhaps should do so at a much larger threshold.


Section 1


100k200k300k400kStep10203040
Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8



Run set
8