Loss Spikes with NoPE Positional Embedding
Created on September 3|Last edited on September 3
Comment
The loss curves show 4 runs for the Pythia 1B model trained from scratch on 100B tokens of FineWeb Edu (Batch Size 64, GAS 2). In two cases there are no positional embeddings (marked with pos_emb: None in legend) and they show an unrecoverable loss spike while the other runs with even 0.01% rope_pct show much higher stability. Is this something well known?
I wondered if this has something to do with the random init/some batch of data being bad so I changed the seed and reran but it again occured somewhere in the same region
The original NoPE paper shows none of this which I can imagine coming from the fact that their models were very small and trained on some synthetic tasks and not real-world data - https://proceedings.neurips.cc/paper_files/paper/2023/hash/4e85362c02172c0c6567ce593122d31c-Abstract-Conference.html
Section 1
Run set
16
Run set
16
Run set
16
Run set
16
Run set
16
Add a comment