Skip to main content

Autoregressive Distillation

All models are EleutherAI's GPT-NeoX models, loosely based on Megatron LM and GPT-3. Models are named after the number of non-embedding params. Models named "X to Y" are distillations of a model of size X into a model of size Y. All models are trained on the Pile with Rotary Embeddings.
Created on August 2|Last edited on August 27

Section 1



1k10k100kStep2345
125M to 350M
350M to 125M
350M
350M to 350M
760M to 125M
Large
125M
125M
1
350M
21
Large
35
350M to 125M
1
125M to 350M
1
350M to 350M
1
760M to 125M
1