Skip to main content

[Tuning] Pre-norm, initialization

To scale well we should follow good practices.
Created on February 12|Last edited on February 13

Post-norm seems better after extended compute

  • This is without any initialization changes.
      • Should maybe revisit with initialization changes.
  • True for RTT/Pitt.
      • Identify LR + LR schedule optima



110100epoch
tag: rtt_m75-sweep-lr
tag: rtt_m75_pre-sweep-lr
Run set
10854



Run set
10854



LR Tuning for RTT. (4-6 e-4)


Run set
10854


For Pitt - lower LR is more important (2, 3e-4)

Run set
10854

Note, the LR result seems consistent whether pre or post norm