Skip to main content

Report (draft)

Created on April 16|Last edited on April 17
Total training hours : 2000 hours

Supervised Fine Tuning

total training hours : 1602 hours
total A100 hours : 9213 hours
pretrained models used for experiments : pythia-12B, llama-30b, llama-7b, llama-13b.

We will now introduce some recommended settings for training a SFT model

Data mix improves overfitting

Adding other conversation based model such as alpaca, dolly or vicuna does improve the models further in loss and
note that models with mixed alpaca gain in terms of higher accuracy and lower loss
(need more elaboration)



Run set
14


Accuracy or loss are not the best metrics

based on our human ratings by the training team, we find eval/accuracy alone doesn't give us a guideline on whether the model performs well in terms of sampling quality. Hence we tried using the score provided by reward model specificaly we trained two reward models on ( fill in later) and use the two different scores as a human rating score for the sampled result.

(need more elaboration)

Correlations between evaluation loss, accuracy and reward model scores
Spearmanlossaccuracyrm score 6.9Brm score 1.4B
loss--0.95240.61900.2619
accuracy---0.5000-0.1904
rm score 6.9B---0.8809


Notable observation
  • 2 stage training scheme where we start training a model to answer the mix of various instruction datasets ( summarization, explaination, math qa, etc ) and finetuning the model on open assistant conversation dataset gain the best reward scores while


Pretrain model of choice matter the most

Reward Model

total training hours : 321 hours
total A100 hours : 9213 hours
pretrained models used for experiments : pythia-1.4b, pythia-6.9b, pythia-1.1b, llama-7b (half of the layers frozen), bloomz-1.1b

Again, pretrained model plays the biggest role

Compare to tuning hyperparameters such as learning rate, dropout, weight decay, pretrained model choice has the largest impact on final accuracy in discerning good and bad response.


Run set
10


We choose pythia series as the overall balanced performance and speed ( flash attention ) for our RLHF training


RLHF

total training hours : 82 hours
theblackcat102
theblackcat102 •  
Correlations between evaluation loss, accuracy and reward model scores calculated from the google sheets training log
Reply