Meta AI Released LLaMA

Up to 65-billion parameters, LLaMA is an open source model more performant than other other LLMs at certain tasks with much less computation.

Vincent Tu

Created on February 24|Last edited on February 25

Comment

LLaMA, or Large Language Model Meta AI, is a new family of open sourced LLMs just introduced by Meta AI. They prove that with trillions of publicly available training tokens, they can produce an LLM that outperforms GPT and other popular LLMs without reaching nearly as many parameters as the other models. Their models range from 7B parameters to 65B parameters. 
Training DataTheir training data was a mixture of many different datasets, adding up to a total of about 1.4T tokens.
﻿
67% CommonCrawl
preprocess 5 CommonCrawl dumps
deduplicate data at line level
language detection with fastTest linear classifier to filter out non-English pages
filter out low quality content with n-gram language model
15% C4
same pipelining as above except low quality content is filtered with heuristics
4.5% Github
publicly available Github dataset on Google BigQuery
filtered out low quality content with heuristics based on line length, proportion of alphanumeric characters, RegEx removing headers, remove boilerplate
deduplicating data at the file level
4.5% Wikipedia
trained on 20 different languages
remove comments, boilerplate text, and links
4.5% Books
trained on The Gutenberg Project and Books3 from ThePile
deduplication at the book level with 90% content overlap threshold
2.5% ArXiv
process LaTeX files
remove everything before first section and remove the bibliography
removed comments, inline-expanded definitions
2.0% Stack Exchange
dataset containing high quality question-answering data across many different fields
kept data from 28 largest websites and removed HTML tags
sort by score
Text was tokenized with SentencePiece's Byte-Pair Encoding (BPE).
Architectural ChangesNormalize the input of each transformer sub-layer with RMSNorm (GPT-3 inspired)
Replaced ReLU activation functions with SwiGLU (PaLM inspired)
Positional Embeddings replaced with Rotary Embeddings (RoPE) (GPTNeo inspired)
Training Specs2048 A100 GPUs with 80GB VRAM
They reimplemented the backward function to optimize training and took advantage of model and sequence parallelism!
Trained with AdamW(β1=0.9,β2=0.95\beta_1=0.9, \beta_2=0.95β1​=0.9,β2​=0.95﻿) and cosine LR scheduler with final LR = 10% of the max LR
weight decay=0.1, gradient clipping=1.0, and warmup of 2000 steps
﻿
Results﻿
﻿
﻿
﻿
﻿
More results can be found in their paper. The gist here is that their model is better in comparison to current reigning models in certain tasks. Overall, it demonstrates strong performance across a diverse variety of tasks. Specifically speaking, they outperform GPT-3, Gopher, and Chinchilla while being much smaller in size.
References“Introducing Llama: A Foundational, 65-Billion-Parameter Language Model.” Meta AI Blog, Meta AI, 24 Feb. 2023.
﻿

Add a comment

Tags: ML News

Iterate on AI agents and models faster. Try Weights & Biases today.