Quality Classifier Comparison
Created on November 20|Last edited on December 3
Comment
Comparing various quality classifiers on:
- 42B Fineweb 18/2024, with cutoff of top 20%
- llama 1.4b
- negative samples: 200k refined web from DCLM
Positive examples:
- Eli5 + oh2.5
- pes2o
- wiki
- mmlu
- stackexchange
Conclusions:
- Increasing ELI5 improves performance more than OH2.5
- pes2o, wiki are especially good on internal_eval
- stackexchange leads to the best perf on eval/bpb with wiki being pretty close
Run set
47
Add a comment