DCLM Replication

* V1 dclm_7b0820: Llama 2 architecture 7b with DCLM's optimizer hyper parameters. Very spiky ,so we reduce the LR in V2 * v2 dclm_7b0820-2: reduced LR * v3 dclm_7b0821: kept reduced LR, added shuffle buffer (100k) and reduced beta2 to match dclm paper * v4 dclm_7b0821-3: v3 but old beta2 (0.999) * v5 dclm_7b0822-1: v3 but with dclm LR (so, beta2=0.95) Other runs Conclusions: * beta2=0.95 important! * higher lr might not matter?

David Leo Wright Hall

Created on August 21|Last edited on May 12

Comment

﻿
Section 1﻿
This set of panels contains runs from a private project, which cannot be shown in this report
﻿
﻿

Add a comment