Skip to main content

Fasttext vs BERT

Created on May 18|Last edited on May 18
The goal of this experiment is to compare the effectiveness of different types of quality classifier models: in particular, a Fasttext bigram model vs. a BERT classifier (the latter trained via full fine-tuning). To this end, we:
  1. train each type of classifier with positive examples drawn from MMLU and negative examples from the DCLM pool;
  2. score FineWeb Common Crawl documents with each classifier;
  3. filter the top x% for x = {10,20};
  4. and train a 1.4B parameter model on the filtered data.

We evaluate the trained models using BPB on various evaluation sets. In general, using the top 20% of data filtered using the BERT model (11504) does the best, outperforming the top 20% using the Fasttext model (394bfd) and the top 10% using both BERT (ce2c9b) and Fasttext (0a63ae). For example, on Paloma BERT beats Fasttext by ~0.015 BPB, which is significant. If we had started from a larger pool, filtering the top 10% would have likely outperformed the top 20%.

02k4k6k8kStep11.522.533.5
02k4k6k8kStep1.522.533.54
02k4k6k8kStep1.522.533.54


BERT is a more powerful model than Fasttext; because we have the compute to run inference with BERT models at Common Crawl scale, there's no real reason to use Fasttext models. That said, we could always repeat this experiment using different sources of positive examples if we want to revisit.