#2 Fine-tuning Qwen2.5-Math-1.5B with REINFORCE++
Created on February 14|Last edited on February 14
Comment
Same as before, we use Qwen2.5-Math-1.5B base model as the initial policy model. The training codebase is built upon OpenRLHF, with reward functions adopted from the huggingface/open-r1 project. We attempted to use the previous evaluation protocol in open-r1, but noticed unexpected LLM completions for quite a few test samples. Therefore, we turn to the evaluation script provided in https://github.com/hkust-nlp/simpleRL-reason, which use Qwen Math's codebase for evaluation, but completely prohibits solving problems by calling code for fairness considerations.
Evaluation results (each step corresponds to 64 training prompts):
Model | Steps | MATH-500 |
---|---|---|
Qwen2.5-Math-1.5B | n/a | 0.350 |
Qwen2.5-Math-1.5B-Instruct | n/a | 0.750 |
Qwen2.5-Math-1.5B w/ REINFORCE++ | 16 | 0.612 |
32 | 0.668 | |
48 | 0.712 | |
64 | 0.726 | |
80 | 0.720 | |
96 | 0.746 | |
112 | 0.758 |
Section 1
Run: Qwen2.5-Math-1.5B_REINFORCE_BASELINE_LARGE_BS
1
Run: Qwen2.5-Math-1.5B_REINFORCE_BASELINE_LARGE_BS
1
Add a comment