Fasttext vs BERT

Created on May 18|Last edited on May 18
Comment
The goal of this experiment is to compare the effectiveness of different types of quality classifier models: in particular, a Fasttext bigram model vs. a BERT classifier (the latter trained via full fine-tuning). To this end, we:
train each type of classifier with positive examples drawn from MMLU and negative examples from the DCLM pool;
score FineWeb Common Crawl documents with each classifier;
filter the top x% for x = {10,20};
and train a 1.4B parameter model on the filtered data.
﻿
We evaluate the trained models using BPB on various evaluation sets. In general, using the top 20% of data filtered using the BERT model (11504) does the best, outperforming the top 20% using the Fasttext model (394bfd) and the top 10% using both BERT (ce2c9b) and Fasttext (0a63ae). For example, on Paloma BERT beats Fasttext by ~0.015 BPB, which is significant. If we had started from a larger pool, filtering the top 10% would have likely outperformed the top 20%.
﻿
eval/paloma/c4_en/bpb
eval/paloma/c4_en/bpb
02k4k6k8kStep11.522.533.5
exp163_compare_bert_fasttext_finew-0a63ae
exp163_compare_bert_fasttext_finew-ce2c9b
exp163_compare_bert_fasttext_finew-394bfd
exp163_compare_bert_fasttext_finew-115504
eval/bpb
eval/bpb
02k4k6k8kStep1.522.533.54
exp163_compare_bert_fasttext_finew-0a63ae
exp163_compare_bert_fasttext_finew-ce2c9b
exp163_compare_bert_fasttext_finew-394bfd
exp163_compare_bert_fasttext_finew-115504
eval/paloma/bpb
eval/paloma/bpb
02k4k6k8kStep1.522.533.54
exp163_compare_bert_fasttext_finew-0a63ae
exp163_compare_bert_fasttext_finew-ce2c9b
exp163_compare_bert_fasttext_finew-394bfd
exp163_compare_bert_fasttext_finew-115504
﻿
﻿
BERT is a more powerful model than Fasttext; because we have the compute to run inference with BERT models at Common Crawl scale, there's no real reason to use Fasttext models. That said, we could always repeat this experiment using different sources of positive examples if we want to revisit.
﻿
Add a comment