Skip to main content

Comparison of eval metrics for OLMoE data ablations

The goal of this analysis is to understand whether our main in-loop downstream evals -- OLMES core 9 plus MMLU -- are sufficiently sensitive to changes in data mix.
Created on July 10|Last edited on July 15


Niklas has performed training runs for a 1B model to 1x and 5x chinchilla, comparing two data variants (Niklas correct me if I'm wrong):

  • newds is (roughly) the DCLM mix
  • oldds is (roughly) OLMo 1.7

The DCLM paper performs 1x chinchilla runs comparing the same two data mixes (again, correct me if wrong), and compares them on their "core" evals. So, we can look at how much the difference in data mix changes our evals vs. theirs. It turns out that our evals are apparently much less sensitive to change in data mix.

In the table below, the first two columns show the training dataset and chinchilla multiplier. OLMES 9 is the average on the 9 olmes core tasks. Downstream avg is the average of all in-loop downstream evals. MMLU var is also an average; I didn't report MC eval because it's basically noise. DCLM metrics are approximate; I had to read them off from Table 3 in the paper.

The OLMES 9 and the overall downstream task average don't change much between the two data mixes. MMLU var is a bit better, but still short of the change reported in the DCLM paper. Based on this, it seems worth it to try to reproduce their evals and confirm that we see this large of a difference; if we do, computing their evals in-loop could be really helpful for making data decisions.

For the first two rows, we're comparing Niklas' run at 2x chinchilla to the DCLM run reported at 1x. The OLMES 9, Downstream avg, and MMLU var metrics are reported on Niklas' run, while the DCLM core metrics are read off from Fig. 3 in the paper.

DatasetChinchilla multiplierOLMES 9Downstream avgMMLU varDCLM core
Dolma v1.72x for Niklas, 1x for DCLM47.639.628.626
DCLM2x for Niklas, 1x for DLCM49 (+3%)39.9 (+1%)30.9 (+8%)30 (+ ~15%)
Dolma v1.75x51.24130.3-
DCLM5x52.2 (+2%)42.4 (+3.4%)33.2 (+9.6%)-
Run set
1201