Naive Preference Feedback DPO ELO Performance - LoRA VS. QLoRA
Created on July 3|Last edited on July 5
Comment
QLoRA Chaiverse submissions:
- baseline: jellywibble-hathor-stabl_8901_v2
- trained on 30K data: jellywibble-qlora-30k-pr_4525_v2
- trained on 60K data: jellywibble-qlora-60k-pr_3859_v2
- trained on 90K data: jellywibble-qlora-90k-pr_7056_v2
- trained on 120K data: jellywibble-qlora-120k-p_350_v2
LoRA Chaiverse submissions:
- baseline: jellywibble-hathor-stabl_8901_v3
- trained on 30K data: jellywibble-lora-30k-pre_9052_v1
- trained on 60K data: jellywibble-lora-60k-pre_728_v1
- trained on 90K data: jellywibble-lora-90k-pre_8367_v1
- trained on 120K data: jellywibble-lora-120k-pr_1572_v1
- trained on 120K data (and 2 epochs, effectively 240K data): jellywibble-lora-120k-pr_2827_v1
Overall Results

Learnings & Conclusions
- Does 4 Bit quantisation hurt model performance?Yes, do not use 4 bit quantisation. Nguy during his alignment DPO training also used full 32bit LoRA fine tune
- How does dataset scailing affect model performance evaluated under Chaiverse?We see under full 32 bit finetune LoRA, performance (ELO) indeed scales with dataset.
- Does huggingface DPO trainer work out of the box, or is there something special to it?Yes, but we still need to verify hyperparemters effect on model training, although it is clear data quality is the most important
- Downstream: does the method for which the developers not using AvB responses directly, but using a reward model to generate AvB drastically reduce the amount of data required?
Methods
- Download preference dataset for all submission ids related to the baseline model submission (~120K)
- Using out-of-the-box DPOConfig with H100 on runpod (total training hours: 24h), detailed configurations can be seen in the run info below
Reward Performance
Run: data/actual_run_fixed
2
Logits Stability
Run: data/actual_run_fixed
1
Add a comment