Skip to main content

LLama3.2-1B Posting Training By GRPO from DeepSeek

Model link: https://huggingface.co/accuracy-maker/Llama-3.2-1B-GRPO-gsm8k Wandb link: https://wandb.ai/accuracy-maker/Llama3.2-1B-GRPO?nw=nwuseraccuracymaker
Created on February 12|Last edited on February 12
Hello guys, I am recently curious about how GRPO works on posting training of one base model. Thus, I did a simple experiment to validate it. My experiment is nearly free and reproduceable for everyone.

Instance Setup

I used AWS platform to finish my overall the experiment. My configuration of instance is:
  • Ubuntu with Pytorch (22.04)
  • g5.xlarge (4 vcpu)
  • 45GB storage


Base Model

I used **Llama3.2-1B-Instruct** as my base model since it is small, open-source and my server can run it perfectly with a limited GPU and CPU.


Dataset

Dataset for training the model is openai/gsm8k which is a small and cleaned dataset for mathematic reasoning.


Training Plot



Run set
2