Orpo - Full Model
Results from applying Orpo fine-tuning to the full model.
Created on June 7|Last edited on June 7
Comment
All runs share similar config, except for the learning rate. We want logps(y, x-) and logps(y, x+) as well as logps(y|x-) and logps(y|x+) to diverge, not logps(x-) and logps(x+).
- Data config: Single constraint per instruction.
- Batch size: 1
LR = 5e-6
At around 0.3 epoch, log odds ratio increased. However, logps(x+) and logps(x-) diverged, which is undesirable. Inference results after 1 epoch also shows poor results: https://docs.google.com/spreadsheets/d/1YqzqbOzWS41sYf79rCDR5W6O-rcYdDyUoAsUgJevfes/edit?gid=1861370952#gid=1861370952
LR = 5e-4, 5e-5, 5e-7, 1e-6
Add a comment