Skip to main content

#2 Fine-tuning Qwen2.5-Math-1.5B with REINFORCE++

Created on February 14|Last edited on February 14
Same as before, we use Qwen2.5-Math-1.5B base model as the initial policy model. The training codebase is built upon OpenRLHF, with reward functions adopted from the huggingface/open-r1 project. We attempted to use the previous evaluation protocol in open-r1, but noticed unexpected LLM completions for quite a few test samples. Therefore, we turn to the evaluation script provided in https://github.com/hkust-nlp/simpleRL-reason, which use Qwen Math's codebase for evaluation, but completely prohibits solving problems by calling code for fairness considerations.

Evaluation results (each step corresponds to 64 training prompts):
ModelStepsMATH-500
Qwen2.5-Math-1.5Bn/a0.350
Qwen2.5-Math-1.5B-Instructn/a0.750
Qwen2.5-Math-1.5B w/ REINFORCE++160.612
320.668
480.712
640.726
800.720
960.746
1120.758



Section 1


Run: Qwen2.5-Math-1.5B_REINFORCE_BASELINE_LARGE_BS
1



Run: Qwen2.5-Math-1.5B_REINFORCE_BASELINE_LARGE_BS
1