Logs of PURE
Records and analysis of PURE's logs
Created on February 9|Last edited on February 23
Comment
The "outcome reward" is calculated as the weighted sum of process rewards. "solved" indicates whether the generated answers match the ground-truth answers. "Match" checks if the outcome reward aligns with the ground-truth reward: it’s true if the outcome reward > 0 and the answer is solved, or if the outcome reward ≤ 0 and the answer isn’t solved. This metric evaluates whether the PRM-derived outcome rewards correctly reflect the outcome-level ground-truth. Note that:
- When using only verifiable rewards, process rewards are 0, making the outcome rewards 0 as well. In this case, the "outcome reward" and "match" metrics should be ignored.
- When using only process rewards, the "solved" metric is only used for monitoring and not for training.
Main experiments on Qwen2.5-Math-7B
The verifiable reward on the training set doesn't determine the score on benchmark. Although the verifiable rewards of Qwen2.5-PURE-VR rise steadily and are always the highest in these 3 models, it gets an overall accuracy of 48.3%, which is lower than that of 49.3 and 53.3 for Qwen2.5-PURE-PRM and Qwen2.5-PURE-PRM+VR, respectively.
Ablation of credit assignment
Because gamma-decay (if ) and min-form credit assignment calculate return as and respectively, the value of outcome rewards across the two methods are not comparable.
The gamma-decay credit assignment method quickly results in rewarding hacking. At step 30, all metrics show drastic changes. At step 80, the model has collapsed and the average score on benchmark is around 30, which is lower than the 39.5 for the base model, and continues to decrease. At step 300, the model hacks process rewards, resulting in increasement in outcome reward while verifiable rewards converge to 0.
Compare with results of gamma-decay, our min-form credit assignment is significant to stable training with PRM and results in SOTA-level Qwen2.5-PURE-PRM and Qwen2.5-PURE-PRM+VR with average scores of 49.3 and 53.3 respectively.
It is worth noting that when we increase the number of ground-truth answers from 800 to 8k, we are also able to stabilize training in the gamma-decay form credit assignment and get a good model with 51.2 overall accuracy.
Ablation of data for PRM+VR
Dataset size | Number of ground-truth answers | AIME 2024 | MATH 500 | AMC | Minerva Math | OlympiadBench | Avg. |
---|---|---|---|---|---|---|---|
8k | 8k | 16.7 | 81.4 | 72.5 | 38.6 | 45.5 | 50.9 |
8k | 800 | 20.0 | 82.6 | 82.5 | 37.1 | 44.1 | 53.3 |
8k | 80 | 13.3 | 82.8 | 65.0 | 37.1 | 43.1 | 48.3 |
80k | 8k | 20.0 | 81.6 | 67.5 | 37.1 | 41.0 | 49.4 |
It's not more data or more ground-truth answers leads to better results.
Ablation on RLOO baselines
Step-level baseline prefers answers with fewer steps. See this issue for an intutive example. This causes the model to hacking process rewards, producing answers with fewer steps but excessively long tokens per step (see the green line which converges to 1 step per answer).
No Aha Moment
Attempts on DeepSeek-R1-Distill-Qwen-1.5B
Add a comment