[Tuning] Pre-norm, initialization
To scale well we should follow good practices.
Created on February 12|Last edited on February 13
Comment
Post-norm seems better after extended compute
- This is without any initialization changes.
- Should maybe revisit with initialization changes.
- True for RTT/Pitt.
- Identify LR + LR schedule optima
Run set
10854
Run set
10854
LR Tuning for RTT. (4-6 e-4)
Run set
10854
For Pitt - lower LR is more important (2, 3e-4)
Run set
10854
Note, the LR result seems consistent whether pre or post norm


Add a comment