Skip to main content

R1-Zero-like training for math

Created on April 1|Last edited on April 2
Context: https://huggingface.co/spaces/open-r1/README/discussions/20

Baseline experiments

Using a lightly preprocessed variant of SynthLabsAI/Big-Math-RL-Verified, where the runs below correspond to the following subsets:
  • v00.0X: train on everything
  • v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
  • v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
For each run we adjust the number of epochs to train on approximately 25k problems.

100200300train/global_step00.00050.0010.0015
100200300train/global_step0.20.40.60.8
100200300train/global_step0.20.40.60.8
All runs
5
Sync ref model vs no sync
2