marin Reports – Weights & Biases

Skip to main content

Experiment 950: How does Learning Rate Schedule In Pretraining Impact SFT?

An Exploration of whether models with the same/similar loss can be harder to SFT

0

2025-05-18

2 weeks ago

0

2025-06-27

4 weeks ago

Marin 32B Spike Fixing

4

2025-06-25

4 weeks ago

Medical Data Anneal

0

2025-05-28

2 months ago

Exp942: MegaMath Annealing

0

2025-05-26

2 months ago

Code Ladder Models

0

2025-05-21

2 months ago

Natural Language Ladder Models

0

2025-05-22

2 months ago

Fantastic Optimizers and Where to Find Them

We present a rigorous study across 10 optimizers on optimization speed in language modeling speedup and reveal that the realistic speedup is significantly lower than claimed. We analyze why this is the case and show new observations about optimization.

0

2025-05-19

2 months ago

0

2025-04-09

2 months ago

MoE vs Dense 1b

0

2025-05-19

2 months ago

ZLoss vs Not 1.4B

0

2025-04-04

2 months ago

Fasttext vs BERT

0

2025-05-18

2 months ago

#934 Phoenix Cooldown Mix Ablation

0

2025-05-18

2 months ago

#934 Phoenix Cooldown Mix Ablation

0

2025-04-21

2 months ago

Reproducing Finemath

0

2025-05-14

2 months ago

Automatically filtering data targeting MMLU Subsets

0

2025-05-14

2 months ago

916 Tootsie Hypnotic Spoonbill

0

2025-03-31

2 months ago

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-06

2 months ago

0

2025-04-30

2 months ago

916 Tootsie Deeper Spoonbill

0

2025-04-25

2 months ago

Tootsie 8B phoenix cooldown ("starling")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-04-24

2 months ago

898 Tootsie Soft-Raccoon

0

2025-03-26

2 months ago

Remat Strategies

0

2025-03-22

2 months ago

Tootsie 8B rewarm v1 ("adept-phoenix")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-14

2 months ago

Tootsie 8B Main Phase ("original-ocelot")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-12

2 months ago

Tootsie 8B cooldown v2 ("groovy-parrot")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-12

2 months ago

Tootsie 8B dessert v2 ("fiery-hippo")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-12

2 months ago

Tootsie 8B dessert v1 ("zircon-badger")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-12

2 months ago

Tootsie 8B cooldown v1 ("monumental-jellyfish")

See https://github.com/stanford-crfm/marin/issues/600 for narrative

0

2025-03-12

2 months ago

0

2025-01-29

2 months ago

654 Scaling Law

https://github.com/stanford-crfm/marin/issues/654

0

2024-12-11

2 months ago

Tokenizer Comparison

Fineweb-EDU 1.4b Tokenizers: * llama2 (~32k) * llama3 (~128k) * neox (~50k)

0

2024-11-19

2 months ago

474: Config Sweeps

## 150m Fast run winner: bf4172 or f2c1ce (2-ish hours) Best run: fcd416 ## 330m Fast run winner: 1631b7 (2.5 hours) Best run: e3faf7

0

2024-10-31

2 months ago

0

2024-10-17

2 months ago

0

2024-09-25

2 months ago

DCLM Replication

* V1 dclm_7b0820: Llama 2 architecture 7b with DCLM's optimizer hyper parameters. Very spiky ,so we reduce the LR in V2 * v2 dclm_7b0820-2: reduced LR * v3 dclm_7b0821: kept reduced LR, added shuffle buffer (100k) and reduced beta2 to match dclm paper * v4 dclm_7b0821-3: v3 but old beta2 (0.999) * v5 dclm_7b0822-1: v3 but with dclm LR (so, beta2=0.95) Other runs Conclusions: * beta2=0.95 important! * higher lr might not matter?

0

2024-08-21

2 months ago

Olmo 7B Replication

Our attempt at replicating olmo 7b. We used the olmo tokenizer and dolma 1.7. Our architecture was almost identical except we use RMSNorm rather than LayerNorm (both versions do not learn either a bias term or a gain term on the layer norm)

0

2024-07-15

2 months ago

Olmo 1b replication report v2

0

2024-07-15

2 months ago

Olmo 1B replication

0

2024-06-08

2 months ago

Different Data Ablations on Cooldown Annealing

0

2025-03-27

3 months ago

845/6 Wiki and Arxiv Quality Ablations

0

herumb-stanford

2025-03-20

4 months ago

Scaling Laws- Plots

0

2025-03-23

4 months ago

Olmo 2 vs Tootsie take2

0

2024-11-15

4 months ago

818 Mixture of Formats

0

herumb-stanford

2025-03-20

4 months ago

Top scoring URLs using MEDU by MMLU Subset

0

2025-03-17

4 months ago

Llama 3.1 vs Tootsie SFT on Total Mixture

0

2025-03-06

5 months ago

0

2025-03-05

5 months ago

Tootsie vs Llama3.1 Tulu SFT

0

2025-03-05

5 months ago

620 Int8 Training

0

2025-02-23

5 months ago

0

2025-02-04

6 months ago

654 Scaling Laws with soft metrics

0

2025-01-16

6 months ago

647: Wikipedia Training Runs with DOLMA source substitution

0

herumb-stanford

2025-01-13

6 months ago

246: Web Extraction Method Comparison

0

herumb-stanford

2024-11-01

7 months ago

High Quality Many Epochs vs. Lower quality fewer epoch

Data browser link: https://marlin-subtle-barnacle.ngrok-free.app/experiment?path=gs%3A//marin-us-central2/experiments/exp636_stackexchange_vs_hqwebpages-a374bc.json

0

2024-12-11

7 months ago

Quality Classifier Comparison

0

2024-11-20

8 months ago

Olmo 7b SFT Run

Reproduced Olmo 7b SFT with the following hyperparameters: Batch

0

2024-11-15

8 months ago

Olmo 7b Sft run

0

2024-10-02

10 months ago

Training Progress - Time to Train Benchmark

0

2024-07-23

12 months ago

Untitled Report

0

2024-07-27

12 months ago

Benchmark on Splash Attention

0

2024-05-24

1 year ago

Training of Llama 1B with Different Seed

0

2024-05-22

1 year ago

train/loss (24/05/08 17:11:31)

0

2024-05-09

1 year ago