Report (draft)
Created on April 16|Last edited on April 17
Comment
Total training hours : 2000 hours
Supervised Fine Tuning
total training hours : 1602 hours
total A100 hours : 9213 hours
pretrained models used for experiments : pythia-12B, llama-30b, llama-7b, llama-13b.
We will now introduce some recommended settings for training a SFT model
Data mix improves overfitting
Adding other conversation based model such as alpaca, dolly or vicuna does improve the models further in loss and
note that models with mixed alpaca gain in terms of higher accuracy and lower loss
(need more elaboration)
Run set
14
Accuracy or loss are not the best metrics
based on our human ratings by the training team, we find eval/accuracy alone doesn't give us a guideline on whether the model performs well in terms of sampling quality. Hence we tried using the score provided by reward model specificaly we trained two reward models on ( fill in later) and use the two different scores as a human rating score for the sampled result.
(need more elaboration)
Correlations between evaluation loss, accuracy and reward model scores
| Spearman | loss | accuracy | rm score 6.9B | rm score 1.4B |
|---|---|---|---|---|
| loss | - | -0.9524 | 0.6190 | 0.2619 |
| accuracy | - | - | -0.5000 | -0.1904 |
| rm score 6.9B | - | - | - | 0.8809 |
Notable observation
- 2 stage training scheme where we start training a model to answer the mix of various instruction datasets ( summarization, explaination, math qa, etc ) and finetuning the model on open assistant conversation dataset gain the best reward scores while
Pretrain model of choice matter the most
Reward Model
total training hours : 321 hours
total A100 hours : 9213 hours
pretrained models used for experiments : pythia-1.4b, pythia-6.9b, pythia-1.1b, llama-7b (half of the layers frozen), bloomz-1.1b
Again, pretrained model plays the biggest role
Compare to tuning hyperparameters such as learning rate, dropout, weight decay, pretrained model choice has the largest impact on final accuracy in discerning good and bad response.
Run set
10
We choose pythia series as the overall balanced performance and speed ( flash attention ) for our RLHF training
RLHF
total training hours : 82 hours
Add a comment
Correlations between evaluation loss, accuracy and reward model scores calculated from the google sheets training log
Reply