Skip to main content

TP, seq-TP, no-TP

Created on July 31|Last edited on August 8
Comparison of loss and grad norm of different variants.
Commit: 8709792c376505d41f71257b803d56713216bf2a

Configs:

Plots:


50100150200250300Step24681012
50100150200250300Step51015202530
Run set
3


Comparison:

  • 08.08.25
  • masked distillation with SFT data
  • red: sequence TP (parallel embeddings = fasle)
  • green: no TP
  • commit: 3073959b50fb853f368a070968aaaab1951d4387 branch distill_sft


Run set
2



Run set
2