Grad norm clipping

Created on March 23|Last edited on March 23
Comment
﻿
Losses and L0 are almost identical for layer 6.
Grad norm kills a few more features for some reason.
It’s basically a wash overall. I’m fine doing whatever for tinystories. Can keep grad norm in for current runs, maybe if we do a final set of runs we can remove them.
Gpt2 can get much larger grad norms, so we may want to continue clipping, but perhaps should do so at a much larger threshold.
﻿
Section 1﻿
loss/train/out_to_in/explained_variance_ln_std/blocks.6.hook_resid_post
loss/train/out_to_in/explained_variance_ln_std/blocks.6.hook_resid_post
100k200k300k400kStep10203040
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
﻿
Run set8
﻿
﻿
Add a comment