Skip to main content
Reports
Created by
Created On
Last edited
Experiment 950: How does Learning Rate Schedule In Pretraining Impact SFT?
An Exploration of whether models with the same/similar loss can be harder to SFT
0
2025-05-18
0
2025-06-27
4
2025-06-25
0
2025-05-28
0
2025-05-26
0
2025-05-21
0
2025-05-22
Fantastic Optimizers and Where to Find Them
We present a rigorous study across 10 optimizers on optimization speed in language modeling speedup and reveal that the realistic speedup is significantly lower than claimed. We analyze why this is the case and show new observations about optimization.
0
2025-05-19
0
2025-04-09
0
2025-05-19
0
2025-04-04
0
2025-05-18
0
2025-05-18
0
2025-04-21
0
2025-05-14
0
2025-03-31
Marin-8B-Base
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-06
0
2025-04-30
0
2025-04-25
Tootsie 8B phoenix cooldown ("starling")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-04-24
0
2025-03-26
0
2025-03-22
Tootsie 8B rewarm v1 ("adept-phoenix")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-14
Tootsie 8B Main Phase ("original-ocelot")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-12
Tootsie 8B cooldown v2 ("groovy-parrot")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-12
Tootsie 8B dessert v2 ("fiery-hippo")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-12
Tootsie 8B dessert v1 ("zircon-badger")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-12
Tootsie 8B cooldown v1 ("monumental-jellyfish")
See https://github.com/stanford-crfm/marin/issues/600 for narrative
0
2025-03-12
0
2025-01-29
654 Scaling Law
https://github.com/stanford-crfm/marin/issues/654
0
2024-12-11
Tokenizer Comparison
Fineweb-EDU 1.4b Tokenizers: * llama2 (~32k) * llama3 (~128k) * neox (~50k)
0
2024-11-19
474: Config Sweeps
## 150m Fast run winner: bf4172 or f2c1ce (2-ish hours) Best run: fcd416 ## 330m Fast run winner: 1631b7 (2.5 hours) Best run: e3faf7
0
2024-10-31
0
2024-10-17
0
2024-09-25
DCLM Replication
* V1 dclm_7b0820: Llama 2 architecture 7b with DCLM's optimizer hyper parameters. Very spiky ,so we reduce the LR in V2 * v2 dclm_7b0820-2: reduced LR * v3 dclm_7b0821: kept reduced LR, added shuffle buffer (100k) and reduced beta2 to match dclm paper * v4 dclm_7b0821-3: v3 but old beta2 (0.999) * v5 dclm_7b0822-1: v3 but with dclm LR (so, beta2=0.95) Other runs Conclusions: * beta2=0.95 important! * higher lr might not matter?
0
2024-08-21
Olmo 7B Replication
Our attempt at replicating olmo 7b. We used the olmo tokenizer and dolma 1.7. Our architecture was almost identical except we use RMSNorm rather than LayerNorm (both versions do not learn either a bias term or a gain term on the layer norm)
0
2024-07-15
0
2024-07-15
0
2024-06-08
0
2025-03-20
0
2025-03-23
0
2024-11-15
0
2025-03-20
0
2025-03-17
0
2025-03-06
0
2025-03-05
0
2025-03-05
0
2025-02-23
0
2025-02-04
0
2025-01-16
0
2024-11-01
High Quality Many Epochs vs. Lower quality fewer epoch
Data browser link: https://marlin-subtle-barnacle.ngrok-free.app/experiment?path=gs%3A//marin-us-central2/experiments/exp636_stackexchange_vs_hqwebpages-a374bc.json
0
2024-12-11
0
2024-11-20
Olmo 7b SFT Run
Reproduced Olmo 7b SFT with the following hyperparameters: Batch
0
2024-11-15
0
2024-10-02
0
2024-07-23
0
2024-07-27
0
2024-05-24
0
2024-05-22
0
2024-05-09