Skip to main content

Quality Classifier Comparison

Created on November 20|Last edited on December 3
Comparing various quality classifiers on:
  • 42B Fineweb 18/2024, with cutoff of top 20%
  • llama 1.4b
  • negative samples: 200k refined web from DCLM

Positive examples:
  • Eli5 + oh2.5
  • pes2o
  • wiki
  • mmlu
  • stackexchange

Conclusions:
  • Increasing ELI5 improves performance more than OH2.5
  • pes2o, wiki are especially good on internal_eval
  • stackexchange leads to the best perf on eval/bpb with wiki being pretty close


Run set
47