Skip to main content

Loss Spikes with NoPE Positional Embedding

Created on September 3|Last edited on September 3
The loss curves show 4 runs for the Pythia 1B model trained from scratch on 100B tokens of FineWeb Edu (Batch Size 64, GAS 2). In two cases there are no positional embeddings (marked with pos_emb: None in legend) and they show an unrecoverable loss spike while the other runs with even 0.01% rope_pct show much higher stability. Is this something well known?
I wondered if this has something to do with the random init/some batch of data being bad so I changed the seed and reran but it again occured somewhere in the same region

The original NoPE paper shows none of this which I can imagine coming from the fact that their models were very small and trained on some synthetic tasks and not real-world data - https://proceedings.neurips.cc/paper_files/paper/2023/hash/4e85362c02172c0c6567ce593122d31c-Abstract-Conference.html

Section 1


2k4k6k8k10k12kStep46810
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 175
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 1, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 0.01, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 1234
2k4k6k8k10k12kStep0.00010.00020.00030.0004
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 175
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 1, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 0.01, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 1234
Run set
16



Run set
16



Run set
16



Run set
16



Run set
16