R1-Zero-like training for math

Created on April 1|Last edited on April 2
Comment
Context: https://huggingface.co/spaces/open-r1/README/discussions/20
Baseline experimentsUsing a lightly preprocessed variant of SynthLabsAI/Big-Math-RL-Verified, where the runs below correspond to the following subsets:
v00.0X: train on everything
v01.0X: train on "medium" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
v02.0X: train on "hard" difficulty problems, inferred by computing percentiles on the distribution of llama 8B solve rates
For each run we adjust the number of epochs to train on approximately 25k problems.
﻿
train/rewards/format_reward
train/rewards/format_reward
100200300train/global_step00.00050.0010.0015
v00.03
v00.01
train/rewards/accuracy_reward
train/rewards/accuracy_reward
100200300train/global_step0.20.40.60.8
v00.03
v00.01
train/reward
train/reward
100200300train/global_step0.20.40.60.8
v00.03
v00.01
 
All runs5
Sync ref model vs no sync2
﻿
﻿
Add a comment