Skip to main content

FineWeb v/s FineWeb-Edu

Created on September 6|Last edited on September 6
To test the hypothesis that high quality data filtered from an existing source makes a difference I ran an experiment where I trained two Llama-3.2-1B models using 100B tokens from FineWeb and FineWeb-Edu. The final loss difference when the models seem to have converged is noticeable (~0.2 points loss difference) which seems to be a big gap if you think in terms of perplexity.
Loss FineWeb = 2.7 → perplexity ≈ exp(2.7) ≈ 14.88.
Loss FineWeb-Edu = 2.5 → perplexity ≈ exp(2.5) ≈ 12.18.
Absolute loss gap = 0.2 (A − B).
Relative loss gap ≈ 8% (0.2 / 2.5).
Relative perplexity improvement ≈ 22% (12.18 vs 14.88)
Note: Same difference is seen in validation loss but I ran that only every 2K steps which is ~8.3 Billion tokens.

Run Statistics -


2k4k6k8k10k12kStep46810
Llama FineWeb
Llama FineWeb-Edu
2k4k6k8k10k12kStep0.00010.00020.00030.0004
seq_length: 2048, dataset: FW, model: llama, pos_emb: rotary, rotary_pct: 1, seed: 1234, lr: 0.0004
seq_length: 2048, dataset: FW_Edu, model: llama, pos_emb: rotary, rotary_pct: 1, seed: 1234, lr: 0.0004
Run set
8



Run set
8



Run set
8



Run set
8



Run set
8