Skip to main content

DPO BTLM

Created on October 23|Last edited on October 23

Showing first 10 runs
0200400600Step00.20.40.60.8
Showing first 10 runs
0200400600Step0.10.20.30.40.50.60.7
Showing first 10 runs
100200300400500600700Step0.50.60.70.8
Showing first 10 runs
100200300400500600700Step0.20.40.60.81
Showing first 10 runs
100200300400500600700Step-6-5-4-3-2-1
Showing first 10 runs
100200300400500600700Step0.60.650.70.75
Run set
36


Results:

As you can see from the graphs above the best performance we get from DPO model using TRL implementation and SFT from original paper. The SFT model trained using TRL doesn't perform well and we can see instabilities in DPO training.