Skip to main content

Reproducing Finemath

Created on May 14|Last edited on May 14
We train several models on various levels of finemath reproductions:
  1. Finemath-DCLM represents running the Finemath classifier to filter the DCLM dataset.
  2. Finemath-replication represents training our own Finemath classifier using the prompt that Finemath used and then using this trained classifier to filter the DCLM Dataset.
  3. Finemath-3-plus represents running the Finemath 3+ dataset directly.
  4. Finemath-cascade-phase2 represents training our own Finemath classifier first using the three point rubric on 500K prompts to filter the data pool from 400B tokens to 40B tokens and then training another Finemath classifier using the five point rubric on 1M prompts to filter the data from 40B tokens to 10B tokens.

Results: We see that the Finemath 3 plus dataset performs the best specifically in the GSM8K benchmark. There are expected gains from doing additional crawling, perplexity filtering, and using some form of latex detection like OpenWebMath. We see that we can match the Finemath classifier that Huggingface trained and perform even better using a cascaded classification system. We see that all of these annealing runs perform better than the DCLM baseline dataset itself.