High Quality Many Epochs vs. Lower quality fewer epoch
Data browser link: https://marlin-subtle-barnacle.ngrok-free.app/experiment?path=gs%3A//marin-us-central2/experiments/exp636_stackexchange_vs_hqwebpages-a374bc.json
Created on December 11|Last edited on December 11
Comment
Goal: We want to see what is more important, high quality for many epochs or lower quality for fewer epochs. It seems like lower quality for fewer epochs is definitively better across all metrics like c4_en/bpb, mmlu/bpb.
Experiment Setting
- Default train: llama 1.4b, 42B tokens
Models trained:
- dolma_stackexchange: Dolma's Stackexchange split which is ~20B Olmo tokens. This means roughly 42/20 = 2.1 Epochs
- dolmino_stackexchange: Dolmino's Stackexchange split which is ~1.26B llama-3 tokens. This means roughly 42/1.26 = 33.33 Epochs
- stackexchange-qa-vote-geq-5-rm-duplicate-200k: High quality filtered web pages using a stack exchange quality classifier (200k stack exchange positives, 200k refined web negatives). Filtered from fineweb CC-MAIN-2024-18, took top 20% of the documents which amounts to ~67B tokens. Trained on 42B tokens so essentially 1 Epoch.
Results:
We see that the stackexchange filtered web pages performs the best by far on c4_en/bpb with dolma stackexchange being closer and dolmino stackexchange's loss trending up the more epochs there are. I assume this is due to overfitting since we train on so many epochs.
Internal Evals
This set of panels contains runs from a private project, which cannot be shown in this report
Evals
This set of panels contains runs from a private project, which cannot be shown in this report
Add a comment