DPO BTLM
Created on October 23|Last edited on October 23
Comment
Showing first 10 runs
Showing first 10 runs
Showing first 10 runs
Showing first 10 runs
Showing first 10 runs
Showing first 10 runs
Run set
36
Results:
As you can see from the graphs above the best performance we get from DPO model using TRL implementation and SFT from original paper. The SFT model trained using TRL doesn't perform well and we can see instabilities in DPO training.
Add a comment