#2 Fine-tuning Qwen2.5-Math-1.5B with REINFORCE++

Created on February 14|Last edited on February 14

Comment

Same as before, we use Qwen2.5-Math-1.5B base model as the initial policy model. The training codebase is built upon OpenRLHF, with reward functions adopted from the huggingface/open-r1 project. We attempted to use the previous evaluation protocol in open-r1, but noticed unexpected LLM completions for quite a few test samples. Therefore, we turn to the evaluation script provided in https://github.com/hkust-nlp/simpleRL-reason, which use Qwen Math's codebase for evaluation, but completely prohibits solving problems by calling code for fairness considerations.﻿﻿
﻿
Evaluation results (each step corresponds to 64 training prompts):

ModelStepsMATH-500
Qwen2.5-Math-1.5Bn/a0.350
Qwen2.5-Math-1.5B-Instructn/a0.750
Qwen2.5-Math-1.5B w/ REINFORCE++160.612
320.668
480.712
640.726
800.720
960.746
1120.758
﻿
﻿
Section 1﻿
Run: Qwen2.5-Math-1.5B_REINFORCE_BASELINE_LARGE_BS1
﻿
﻿
﻿
Run: Qwen2.5-Math-1.5B_REINFORCE_BASELINE_LARGE_BS1
﻿
﻿

Model	Steps	MATH-500
Qwen2.5-Math-1.5B	n/a	0.350
Qwen2.5-Math-1.5B-Instruct	n/a	0.750
Qwen2.5-Math-1.5B w/ REINFORCE++	16	0.612
	32	0.668
	48	0.712
	64	0.726
	80	0.720
	96	0.746
	112	0.758

Add a comment