DCLM Replication
* V1 dclm_7b0820: Llama 2 architecture 7b with DCLM's optimizer hyper parameters. Very spiky ,so we reduce the LR in V2
* v2 dclm_7b0820-2: reduced LR
* v3 dclm_7b0821: kept reduced LR, added shuffle buffer (100k) and reduced beta2 to match dclm paper
* v4 dclm_7b0821-3: v3 but old beta2 (0.999)
* v5 dclm_7b0822-1: v3 but with dclm LR (so, beta2=0.95)
Other runs
Conclusions:
* beta2=0.95 important!
* higher lr might not matter?
Created on August 21|Last edited on May 12
Comment
Section 1
This set of panels contains runs from a private project, which cannot be shown in this report
Add a comment