Autoregressive Distillation

All models are EleutherAI's GPT-NeoX models, loosely based on Megatron LM and GPT-3. Models are named after the number of non-embedding params. Models named "X to Y" are distillations of a model of size X into a model of size Y. All models are trained on the Pile with Rotary Embeddings.

Stella Biderman, Preetham Gali

Created on August 2|Last edited on August 27

Comment

﻿
Section 1﻿
﻿
Train Loss
Train Loss
1k10k100kStep2345
125M to 350M
350M to 125M
350M
350M to 350M
760M to 125M
Large
125M
125M1
350M21
Large35
350M to 125M1
125M to 350M1
350M to 350M1
760M to 125M1
﻿
﻿

Add a comment