LLama3.2-1B Posting Training By GRPO from DeepSeek
Model link: https://huggingface.co/accuracy-maker/Llama-3.2-1B-GRPO-gsm8k
Wandb link: https://wandb.ai/accuracy-maker/Llama3.2-1B-GRPO?nw=nwuseraccuracymaker
Created on February 12|Last edited on February 12
Comment
Hello guys, I am recently curious about how GRPO works on posting training of one base model. Thus, I did a simple experiment to validate it. My experiment is nearly free and reproduceable for everyone.
Instance Setup
I used AWS platform to finish my overall the experiment. My configuration of instance is:
- Ubuntu with Pytorch (22.04)
- g5.xlarge (4 vcpu)
- 45GB storage
Base Model
I used **Llama3.2-1B-Instruct** as my base model since it is small, open-source and my server can run it perfectly with a limited GPU and CPU.
Dataset
Dataset for training the model is openai/gsm8k which is a small and cleaned dataset for mathematic reasoning.
Training Plot
Run set
2
Add a comment