Comparison of eval metrics for OLMoE data ablations
Niklas has performed training runs for a 1B model to 1x and 5x chinchilla, comparing two data variants (Niklas correct me if I'm wrong):
newds
is (roughly) the DCLM mixoldds
is (roughly) OLMo 1.7
The DCLM paper performs 1x chinchilla runs comparing the same two data mixes (again, correct me if wrong), and compares them on their "core" evals. So, we can look at how much the difference in data mix changes our evals vs. theirs. It turns out that our evals are apparently much less sensitive to change in data mix.
In the table below, the first two columns show the training dataset and chinchilla multiplier. OLMES 9
is the average on the 9 olmes core tasks. Downstream avg
is the average of all in-loop downstream evals. MMLU var
is also an average; I didn't report MC eval because it's basically noise. DCLM metrics are approximate; I had to read them off from Table 3 in the paper.
The OLMES 9 and the overall downstream task average don't change much between the two data mixes. MMLU var is a bit better, but still short of the change reported in the DCLM paper. Based on this, it seems worth it to try to reproduce their evals and confirm that we see this large of a difference; if we do, computing their evals in-loop could be really helpful for making data decisions.
For the first two rows, we're comparing Niklas' run at 2x chinchilla to the DCLM run reported at 1x. The OLMES 9
, Downstream avg
, and MMLU var
metrics are reported on Niklas' run, while the DCLM core
metrics are read off from Fig. 3 in the paper.
Dataset | Chinchilla multiplier | OLMES 9 | Downstream avg | MMLU var | DCLM core |
---|---|---|---|---|---|
Dolma v1.7 | 2x for Niklas, 1x for DCLM | 47.6 | 39.6 | 28.6 | 26 |
DCLM | 2x for Niklas, 1x for DLCM | 49 (+3%) | 39.9 (+1%) | 30.9 (+8%) | 30 (+ ~15%) |
Dolma v1.7 | 5x | 51.2 | 41 | 30.3 | - |
DCLM | 5x | 52.2 (+2%) | 42.4 (+3.4%) | 33.2 (+9.6%) | - |