TP, seq-TP, no-TP
Created on July 31|Last edited on August 8
Comment
Comparison of loss and grad norm of different variants.
Commit: 8709792c376505d41f71257b803d56713216bf2a
Configs:
Plots:
Run set
3
Comparison:
- 08.08.25
- masked distillation with SFT data
- red: sequence TP (parallel embeddings = fasle)
- green: no TP
- commit: 3073959b50fb853f368a070968aaaab1951d4387 branch distill_sft
Run set
2
Run set
2
Add a comment