Loss Spikes with NoPE Positional Embedding

Created on September 3|Last edited on September 3
Comment
The loss curves show 4 runs for the Pythia 1B model trained from scratch on 100B tokens of FineWeb Edu (Batch Size 64, GAS 2). In two cases there are no positional embeddings (marked with pos_emb: None in legend) and they show an unrecoverable loss spike while the other runs with even 0.01% rope_pct show much higher stability. Is this something well known?
I wondered if this has something to do with the random init/some batch of data being bad so I changed the seed and reran but it again occured somewhere in the same region
﻿
The original NoPE paper shows none of this which I can imagine coming from the fact that their models were very small and trained on some synthetic tasks and not real-world data - https://proceedings.neurips.cc/paper_files/paper/2023/hash/4e85362c02172c0c6567ce593122d31c-Abstract-Conference.html
Section 1﻿
train/lm_loss
train/lm_loss
2k4k6k8k10k12kStep46810
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 238
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 953
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 175
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 1, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 0.01, seed: 1234
train/learning_rate
train/learning_rate
2k4k6k8k10k12kStep0.0010.0020.0030.004
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 238
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 953
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: none, rotary_pct: 1, seed: 175
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 1, seed: 1234
seq_length: 2048, dataset: FW_Edu, model: pythia, pos_emb: rotary, rotary_pct: 0.01, seed: 1234
Run set48
﻿
﻿
﻿
Run set48
﻿
﻿
﻿
Run set48
﻿
﻿
﻿
Run set48
﻿
﻿
﻿
Run set48
﻿
﻿
Add a comment