LLama3.2-1B Posting Training By GRPO from DeepSeek

Model link: https://huggingface.co/accuracy-maker/Llama-3.2-1B-GRPO-gsm8k Wandb link: https://wandb.ai/accuracy-maker/Llama3.2-1B-GRPO?nw=nwuseraccuracymaker

Gao Haitao

Created on February 12|Last edited on February 12

Comment

Hello guys, I am recently curious about how GRPO works on posting training of one base model. Thus, I did a simple experiment to validate it. My experiment is nearly free and reproduceable for everyone.
Instance SetupI used AWS platform to finish my overall the experiment. My configuration of instance is:
Ubuntu with Pytorch (22.04)
g5.xlarge (4 vcpu)
45GB storage
﻿
Base ModelI used **Llama3.2-1B-Instruct** as my base model since it is small, open-source and my server can run it perfectly with a limited GPU and CPU. 
﻿
DatasetDataset for training the model is openai/gsm8k which is a small and cleaned dataset for mathematic reasoning.
﻿
Training Plot﻿
 
﻿
Run set2
﻿
﻿

Add a comment