Meta AI Released LLaMA
Up to 65-billion parameters, LLaMA is an open source model more performant than other other LLMs at certain tasks with much less computation.
Created on February 24|Last edited on February 25
Comment
LLaMA, or Large Language Model Meta AI, is a new family of open sourced LLMs just introduced by Meta AI. They prove that with trillions of publicly available training tokens, they can produce an LLM that outperforms GPT and other popular LLMs without reaching nearly as many parameters as the other models. Their models range from 7B parameters to 65B parameters.
Training Data
Their training data was a mixture of many different datasets, adding up to a total of about 1.4T tokens.

- 67% CommonCrawl
- preprocess 5 CommonCrawl dumps
- deduplicate data at line level
- language detection with fastTest linear classifier to filter out non-English pages
- filter out low quality content with n-gram language model
- 15% C4
- same pipelining as above except low quality content is filtered with heuristics
- 4.5% Github
- publicly available Github dataset on Google BigQuery
- filtered out low quality content with heuristics based on line length, proportion of alphanumeric characters, RegEx removing headers, remove boilerplate
- deduplicating data at the file level
- 4.5% Wikipedia
- trained on 20 different languages
- remove comments, boilerplate text, and links
- 4.5% Books
- trained on The Gutenberg Project and Books3 from ThePile
- deduplication at the book level with 90% content overlap threshold
- 2.5% ArXiv
- process LaTeX files
- remove everything before first section and remove the bibliography
- removed comments, inline-expanded definitions
- 2.0% Stack Exchange
- dataset containing high quality question-answering data across many different fields
- kept data from 28 largest websites and removed HTML tags
- sort by score
Text was tokenized with SentencePiece's Byte-Pair Encoding (BPE).
Architectural Changes
- Normalize the input of each transformer sub-layer with RMSNorm (GPT-3 inspired)
- Replaced ReLU activation functions with SwiGLU (PaLM inspired)
- Positional Embeddings replaced with Rotary Embeddings (RoPE) (GPTNeo inspired)
Training Specs
- 2048 A100 GPUs with 80GB VRAM
- They reimplemented the backward function to optimize training and took advantage of model and sequence parallelism!
- Trained with AdamW() and cosine LR scheduler with final LR = 10% of the max LR
- weight decay=0.1, gradient clipping=1.0, and warmup of 2000 steps

Results



More results can be found in their paper. The gist here is that their model is better in comparison to current reigning models in certain tasks. Overall, it demonstrates strong performance across a diverse variety of tasks. Specifically speaking, they outperform GPT-3, Gopher, and Chinchilla while being much smaller in size.
References
“Introducing Llama: A Foundational, 65-Billion-Parameter Language Model.” Meta AI Blog, Meta AI, 24 Feb. 2023.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.