Report (draft)

Created on April 16|Last edited on April 17

Comment

Total training hours : 2000 hours
Supervised Fine Tuningtotal training hours : 1602 hours
total A100 hours : 9213 hours
pretrained models used for experiments : pythia-12B,  llama-30b, llama-7b, llama-13b.
﻿
We will now introduce some recommended settings for training a SFT model
Data mix improves overfittingAdding other conversation based model such as alpaca, dolly or vicuna does improve the models further in loss and  
note that models with mixed alpaca gain in terms of higher accuracy and lower loss
(need more elaboration)
﻿
﻿
﻿
Run set14
﻿
Accuracy or loss are not the best metricsbased on our human ratings by the training team, we find eval/accuracy alone doesn't give us a guideline on whether the model performs well in terms of  sampling quality. Hence we tried using the score provided by reward model specificaly we trained two reward models on ( fill in later) and use the two different scores as a human rating score for the sampled result. 
﻿
(need more elaboration)
﻿
Correlations between evaluation loss, accuracy and reward model scores  

Spearmanlossaccuracyrm score 6.9Brm score 1.4B
loss--0.95240.61900.2619
accuracy---0.5000-0.1904
rm score 6.9B---0.8809
﻿
﻿
Notable observation
2 stage training scheme where we start training a model to answer the mix of various instruction datasets ( summarization, explaination, math qa, etc ) and finetuning the model on open assistant conversation dataset gain the best reward scores while 
﻿
Pretrain model of choice matter the most
Reward Model total training hours : 321 hours
total A100 hours : 9213 hours
pretrained models used for experiments : pythia-1.4b, pythia-6.9b, pythia-1.1b, llama-7b (half of the layers frozen), bloomz-1.1b
Again, pretrained model plays the biggest roleCompare to tuning hyperparameters such as learning rate, dropout, weight decay,  pretrained model choice has the largest impact on final accuracy in discerning good and bad response.
﻿
﻿
Run set10
﻿
﻿
We choose pythia series as the overall balanced performance and speed ( flash attention ) for our RLHF training
﻿
RLHFtotal training hours : 82 hours
﻿

Spearman	loss	accuracy	rm score 6.9B	rm score 1.4B
loss	-	-0.9524	0.6190	0.2619
accuracy	-	-	-0.5000	-0.1904
rm score 6.9B	-	-	-	0.8809

Add a comment

theblackcat102 • 3 years ago

Correlations between evaluation loss, accuracy and reward model scores calculated from the google sheets training log