Skip to main content

Different Data Ablations on Cooldown Annealing

Created on March 27|Last edited on April 17


Introduction

We explore the effects of QA Data in cooldown annealing. Specifically, we look at how QA data affects benchmarks that are 1. related to the QA data and 2. unrelated to the QA data. We explore this by looking at the effects of using high-quality QA data related to a targeted benchmark vs. using FLAN.

Experiments

We anneal the llama-8b tootsie models from the 660k step checkpoint. We seek to observe the effects of running MEDU on the science subsets of MMLU to see if we observe a significant change in the accuracy.
The following data mixes were used to train each model:
  1. medu-control (green): 100% DCLM Baseline
  2. medu-control-w-flan (purple): 85% DCLM Baseline, 15% FLAN
  3. 8b-quality-noflan-tokenized-medu-candidate-mmlu-science (blue): 70% DCLM Baseline, 30% MEDU-MMLU-Science Filtered
  4. 8b-quality-eval-tokenized-medu-candidate-mmlu-science-ab05f9 (yellow): 70% DCLM Baseline, 15% MEDU-MMLU-Science-Filtered, 15% FLAN
  5. 8b-dclm-70-og-15-qa-15-50b-tokenized-medu-candidate-mmlu-science-ba9a29 (light green): 70% DCLM Baseline, 15% MEDU-MMLU Science Filtered in QA form, 15% MEDU-MMLU-Science Filtered
  6. 8b-dclm-70-qa-30-50b-tokenized-medu-candidate-mmlu-science-ll-qa-7d2930: 70% DCLM Baseline, 30% MEDU-MMLU Science Filtered in QA Form.

Effects of QA Data on Validation Set

Observations:
We see that FLAN has the lowest bpb across redpajama, wikitext, and paloma. Furthermore, we notice that in the subsets redpajama and wikitext, the data that does not include any QA data (blue, green) seem to have higher bpb than the other runs that have QA data. This indicates that these validation sets target data more in the QA form. For c4en, the control of 100% DCLM Baseline has the lowest bpb compared to the other datasets.

This seems to indicate that the FLAN data is particularly good for lowering the bpb of the paloma validation sets that appear like wikipedia text. However, it is not as good for text like c4 english or s2orc (science papers) which in this case is what we are targetting for science.

This set of panels contains runs from a private project, which cannot be shown in this report



Effects of QA Data on Downstream Tasks


In the science subset, we see that the two best results come from the 70-30QA and 70-15QA-15MEDU mix by roughly two points compared to the control model. This shows that rewriting the data in QA form can be useful in boosting scores. We also notice that FLAN performed worse on average compared to the no flan mixes. This indicates the science data that we used is high quality.



Effects of Longer Epochs


In this section, we examine the effects of increasing the length of the annealing to 100B tokens instead of 50B. We train on 100B tokens in a 60/40 ratio of DCLM/MEDU-Science filtered data. Note that the other model was trained with a 70-30 ratio of DCLM/MEDU which means that the 100B model was trained with 40B specialized tokens vs. the 50B model was trained with only 15B specialized tokens.

This means that we can train a model with the same relative performance with 2.6x less specialized tokens, simply by QA rewriting the data.

This set of panels contains runs from a private project, which cannot be shown in this report


Effects of Increasing the Percentage of High Quality Data


In the previous sections, we looked at fixing the percentage of the high quality MEDU-MMLU filtered data at 30%. We now examine what happens if we increase it to 40%. The conclusion is that increasing the amount of data to 40% leads to only a slight increase in the overall MMLU science score. A potential hypothesis is that there is a diminishing return to increasing the amount of high quality specialized data.

This set of panels contains runs from a private project, which cannot be shown in this report


Comparision between Finemath and MEDU MMLU Mathematics

We seek to compare our dataset to the best possible large-scale domain dataset which is Finemath. We compare three runs:
  1. datashop-candidate-finemath-dclm (pink): 70% DCLM, 30% finemath classifier filtered DCLM
  2. 8b-quality-eval-noflan-30-tokenized-finemath_3_plus (green): 70% DCLM, 30% Finemath
  3. datashop-candidate-mmlu-mathematics (brown): 70% DCLM, 30% MEDU MMLU-mathematics filtered DCLM

From the plots below, we see that Finemath is by far the best dataset for elementary mathematics and high school mathematics. We also see that the finemath classifier is good at picking up elementary-level mathematics material.
One explanation for why the Finemath dataset performs so well is because they use the OWM pipeline on top of the quality filter model which proves to be very important in filtering documents that might have scored well but are not related to mathematics.

We see that the replication in blue of using the finemath prompt to train our classifier leads to roughly the same performance as the finemath classifier itself. This leads me to believe that the handcrafted prompt from Finemath is better than the one created by the LLM from MEDU mathematics.

Open questions:
  1. It looks like we need another filter or pipeline on top of the existing one to get the "format" that we want or the "type" of material that we want.

This set of panels contains runs from a private project, which cannot be shown in this report



Extracting QA Pairs from the Wild


This set of panels contains runs from a private project, which cannot be shown in this report