Autoregressive Distillation
All models are EleutherAI's GPT-NeoX models, loosely based on Megatron LM and GPT-3. Models are named after the number of non-embedding params. Models named "X to Y" are distillations of a model of size X into a model of size Y.
All models are trained on the Pile with Rotary Embeddings.
Created on August 2|Last edited on August 27
Comment
Section 1
125M
1
350M
21
Large
35
350M to 125M
1
125M to 350M
1
350M to 350M
1
760M to 125M
1
Add a comment