[Tuning] Pre-norm, initialization

To scale well we should follow good practices.

Created on February 12|Last edited on February 13

Comment

﻿
Post-norm seems better after extended computeThis is without any initialization changes.
Should maybe revisit with initialization changes.
True for RTT/Pitt.
Identify LR + LR schedule optima
﻿
﻿
﻿
eval_loss
eval_loss
110100epoch
tag: rtt_m75-sweep-lr
tag: rtt_m75_pre-sweep-lr
Run set10854
﻿
﻿
﻿
Run set10854
﻿
﻿
﻿
LR Tuning for RTT. (4-6 e-4)
﻿
﻿
Run set10854
﻿
﻿
For Pitt - lower LR is more important (2, 3e-4)
﻿
Run set10854
﻿
Note, the LR result seems consistent whether pre or post norm
﻿
﻿
﻿
﻿

Add a comment