Skip to main content

Logs of PURE

Records and analysis of PURE's logs
Created on February 9|Last edited on February 23
The "outcome reward" is calculated as the weighted sum of process rewards. "solved" indicates whether the generated answers match the ground-truth answers. "Match" checks if the outcome reward aligns with the ground-truth reward: it’s true if the outcome reward > 0 and the answer is solved, or if the outcome reward ≤ 0 and the answer isn’t solved. This metric evaluates whether the PRM-derived outcome rewards correctly reflect the outcome-level ground-truth. Note that:
  1. When using only verifiable rewards, process rewards are 0, making the outcome rewards 0 as well. In this case, the "outcome reward" and "match" metrics should be ignored.
  2. When using only process rewards, the "solved" metric is only used for monitoring and not for training.

Main experiments on Qwen2.5-Math-7B


0100200300400500Step4005006007008009001000
Qwen2.5-PURE-VR
Qwen2.5-PURE-PRM
Qwen2.5-PURE-PRM+VR
0100200300400500Step0.10.20.30.40.50.60.7
Qwen2.5-PURE-VR
Qwen2.5-PURE-PRM
Qwen2.5-PURE-PRM+VR
0100200300400500Step0.50.550.60.650.70.750.8
Qwen2.5-PURE-VR
Qwen2.5-PURE-PRM
Qwen2.5-PURE-PRM+VR


The verifiable reward on the training set doesn't determine the score on benchmark. Although the verifiable rewards of Qwen2.5-PURE-VR rise steadily and are always the highest in these 3 models, it gets an overall accuracy of 48.3%, which is lower than that of 49.3 and 53.3 for Qwen2.5-PURE-PRM and Qwen2.5-PURE-PRM+VR, respectively.

Ablation of credit assignment



Because gamma-decay (if γ=1\gamma=1) and min-form credit assignment calculate return as R=r1++rnR=r_1+\cdots+r_n and R=min(r1,,rn)R=\min (r_1,\cdots,r_n) respectively, the value of outcome rewards across the two methods are not comparable.
The gamma-decay credit assignment method quickly results in rewarding hacking. At step 30, all metrics show drastic changes. At step 80, the model has collapsed and the average score on benchmark is around 30, which is lower than the 39.5 for the base model, and continues to decrease. At step 300, the model hacks process rewards, resulting in increasement in outcome reward while verifiable rewards converge to 0.
Compare with results of gamma-decay, our min-form credit assignment is significant to stable training with PRM and results in SOTA-level Qwen2.5-PURE-PRM and Qwen2.5-PURE-PRM+VR with average scores of 49.3 and 53.3 respectively.
It is worth noting that when we increase the number of ground-truth answers from 800 to 8k, we are also able to stabilize training in the gamma-decay form credit assignment and get a good model with 51.2 overall accuracy.

Ablation of data for PRM+VR



Dataset sizeNumber of ground-truth answersAIME 2024MATH 500AMCMinerva MathOlympiadBenchAvg.
8k8k16.781.472.538.645.550.9
8k80020.082.682.537.144.153.3
8k8013.382.865.037.143.148.3
80k8k20.081.667.537.141.049.4

It's not more data or more ground-truth answers leads to better results.

Ablation on RLOO baselines



Step-level baseline prefers answers with fewer steps. See this issue for an intutive example. This causes the model to hacking process rewards, producing answers with fewer steps but excessively long tokens per step (see the green line which converges to 1 step per answer).

No Aha Moment




Attempts on DeepSeek-R1-Distill-Qwen-1.5B