FineWeb v/s FineWeb-Edu
Created on September 6|Last edited on September 6
Comment
To test the hypothesis that high quality data filtered from an existing source makes a difference I ran an experiment where I trained two Llama-3.2-1B models using 100B tokens from FineWeb and FineWeb-Edu. The final loss difference when the models seem to have converged is noticeable (~0.2 points loss difference) which seems to be a big gap if you think in terms of perplexity.
Loss FineWeb = 2.7 → perplexity ≈ exp(2.7) ≈ 14.88.
Loss FineWeb-Edu = 2.5 → perplexity ≈ exp(2.5) ≈ 12.18.
Absolute loss gap = 0.2 (A − B).
Relative loss gap ≈ 8% (0.2 / 2.5).
Relative perplexity improvement ≈ 22% (12.18 vs 14.88)
Note: Same difference is seen in validation loss but I ran that only every 2K steps which is ~8.3 Billion tokens.
Run Statistics -
Run set
8
Run set
8
Run set
8
Run set
8
Run set
8
Add a comment