R1-Zero-like training for math
Created on April 1|Last edited on April 2
Comment
Context: https://huggingface.co/spaces/open-r1/README/discussions/20
Baseline experiments
Using a lightly preprocessed variant of SynthLabsAI/Big-Math-RL-Verified, where the runs below correspond to the following subsets:
- v00.0X: train on everything
- v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
- v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
For each run we adjust the number of epochs to train on approximately 25k problems.
5
Sync ref model vs no sync
2
Add a comment