A Survey of Large Language Models

In this article, I summarize the finding the in the March 31, 2023 paper "A Survey of Large Language Models".
Created on April 3|Last edited on April 6
Comment
﻿
﻿
In this article, I'll be outlining the findings in the recent and important paper, "A Survey of Large Language Models". 
I'll be covering:
IntroCharacteristicsUncertaintiesOverviewEmergent AbilitiesKey TechniquesLLM ResourcesRunning Your Own ModelsAccessing via APIsCommon DatasetsLibrary ResourcesPre-TrainingData CollectionData PreprocessingArchitectureAdaptation TuningUtilizationCapacity EvaluationConclusionRecommended Reading
﻿
﻿
Let's jump in ...
IntroLanguage modeling (LM), the field of predicting the next words in word sequences, can be divided into 4 developmental stages:
Statisical Language Modeling (SLM): methods from the 1990s where a simple n-gram model predicts the next word based on recent context (Markov assumption)
Neural Language Models (NLM): use neural networks like RNNs, LSTMs, GRUs, word2vec
Pretrained Language Models (PLM): ELMo, BERT, BART, GPT-2﻿
Large Language Models (LLM): larger PLMs like GPT-4, ChatGPT, PaLM, Sparrow, Claude, Microsoft 365's AI, etc
CharacteristicsThe paper describes 3 characteristics of LLMs that differentiate them from PLMs:
surprising emergent abilities
LLMs revolutionize the way we develop and use AI algorithms
development of LLM draws no clear distinction between research and engineering
I believe this is to say, in other words, that due to the dawning newfound capabilities of LLMs, we have not just found it relevant in research but also in industry and application, demanding skills from engineers and researchers to intermingle.
UncertaintiesThere still exist some questions about:
why do these emergent abilities occur?
difficulties training LLMs in research as they are primarily trained in industry
the alignment problem (aligning LLM behavior with human values)
 This paper covers 4 themes:
how to pre-train,
adaptation tuning for effectiveness and safety,
how to use LLMs for downstream tasks (utilization), and
evaluation (how to evaluate LLMs).
Overview
Emergent AbilitiesIn-Context Learning (ICL): demonstrate a task a few times to the model and ask it to repeat that demonstration for a new example
Instruction Following: without using explicit examples of unseen tasks, an LLM can perform well on it given the proper instruction to do so
Step-by-Step Reasoning: LLMs can show intermediate steps of how it is "reasoning" to get to its answer
Key TechniquesThese are a subset of many techniques that led to the success of LLMs.
Scaling: scaling the size of the model (GPT-3 sits at 175B and PaLM at 540B!)
Training: effectively training an LLM requires a lot (e.g. DeepSpeed and Megatron-LM for parallelization, restarting after loss spike for stable training, etc)
Alignment Tuning: Reinforcement Learning with Human Feedback (RLHF) from InstructGPT is one example of tuning the alignment of the model w.r.t human values
Tool Manipulation: plug-ins for ChatGPT and GPT-4 give it a much wider range of application
LLM ResourcesThis section is for those who want to dabble with LLMs but have lots of factors to consider (primarily hardware).
Running Your Own ModelsLLMs don't have a strict cut-off in terms of parameter count. It is usually around the 10B parameter range. 
For models with tens of billions of parameters, consider mT5,  T0, GPT-NeoX-20B, CodeGen, UL2, Flan-T5, mT0, PanGu-α, and LLaMA.  
For models with	hundreds of billions of parameters, consider OPT, BLOOM, and BLOOMZ. 
It's important to measure the FLOPS (floating point operations per second) to see how much compute/memory is needed for these models.
Accessing via APIsOpenAI has released the ChatGPT API and they also provide a web-based interface for interacting with their GPT models. HuggingFace also has a model registry in which you can download model weights, use their inference endpoints, or even just query in the browser.
Common DatasetsWe can separate the most popular datasets into these categories:
Books
BookCorpus
Project Gutenberg
Books1 and Books2
CommonCrawl
C4 (and its variants)
CC-Stories
CC-News
RealNews
Reddit Links
WebText and/or OpenWebText
PushShift.io
Wikipedia
Code
GitHub
Stack Overflow
Google BigQuery dataset
Others
The Pile
ROOTS
Library ResourcesHugging Face Transformers lets you easily access, train, and use hundreds of models
PyTorch DeepSpeed developed by Microsoft is a DL optimization library
Megatron-LM developed by NVIDIA provides an array of optimization techniques for training LLMs
﻿JAX is a relatively new ML library developed by Google Brain
Colossal-AI developed by EleutherAI is for training LLMs
The paper lists a few more!
Pre-TrainingThis section answers the question: how do we effectively pretrain? This section covers data collection and processing, model architecture, and training techniques.
The data pipeline.
Data CollectionData sources can be divided into 2 categories: general data and specialized data. Webpages, conversation texts, and books fall under the general data umbrella, and scientific text, code, and multilingual text fall under specialized data.
Data PreprocessingThere are 2 general classes of approaches to this: classifier-based and heuristics-based. Classifier-based methods train a binary classifier on "good" data like Wikipedia and its job is to classify whether some data is good or not. This has been prone to filter out good specialized data. BLOOM and Gopher employ heuristic methods:
language filtering: filter out languages that aren't expected to be prevalent for the LLM's task
metric filtering: evaluation metrics about text can be used to remove unnatural text
statistic filtering: engineered statistical features from the data like punctuation distribution
Deduplication is also very important. This is the process of removing duplicates from the data. Deduplication can exist at the sentence, document, and dataset levels.
The standard practice is to remove repetitive sentences and phrases at the sentence level, and remove duplicate documents that overlap by a certain metric. Deduplication also happens across datasets. That is, deduplicate the train and validation datasets.
Privacy redaction is the process of removing personally identifiable information (PII) from the dataset. One effective approach is to use rule-based methods that can identify addresses, names, personal information, etc.
Tokenization is also really important to the LLM's task. Though convenient to use an existing tokenizer, it's often best to use a customized one for your task.
It's also important to consider what your distribution of training data is and how good that data is. Having duplicate data and not balancing your distribution of data will affect downstream tasks. 
ArchitectureArchitectures use the Transformer architecture as the default. LLM models generally fall into 3 architectural categories: encoder-decoder (T5), casual decoder (GPT-3), and prefix decoder (PaLM).
Encoder-decoder models have an encoder that has multi-head self-attention layers to generate latent representations and the decoder uses cross-attention and autoregressively predicts outputs with these inputs.
Casual decoder models use only the decoder with masked attention layers. Most LLMs nowadays are casual decoder models. 
Prefix decoder models are the same as casual decoder models except they use bidirectional attention (not masked) on the prefix and masked attention on the decoder's generated output.
Configuration of the layers within transformer blocks is also key. Extensive studies have tested where normalization (layer norm specifically) should go, what types of normalizations there are (RMS Norm, Deep Norm), what activation functions to use (GeLU being the most common, and also GLU variants like SwiGLU and GeGLU), and fixed or, more commonly used, learned position embeddings (most common learned PE method is RoPE). They summarize the following set of configurations:
pre-RMS Norm
SwiGLU or GeGLU
RoPE or ALiBi for position embeddings
Below is a table from the paper showing specs of popular LLMs.
﻿
Model training is typically done 
with a large batch size (1.6M tokens or in GPT, they gradually scaled from 32k to 3.2M)
linear warm-up schedule for the first 0.1-0.5% of training to a max LR of 5×10−55 \times 10^{-5}5×10−5﻿ to 1×10−41 \times 10^{-4}1×10−4﻿ then use cosine decay scheduling to reduce LR gradually to 10% of max value
Adam or AdamW are standard optimizers (I've also heard of Amos and LION)
mixed precision, dropout, weight decay, gradient clipping
3D parallelism is a combination of data, model, and tensor parallelism (splitting a tensor across multiple devices); Check out DeepSpeed, Colossal-AI, and Alpa
for faster inference, use quantization if applicable
Adaptation TuningAdaptation tuning is another way of saying fine-tuning for a specific task. That is, there are ways after pretraining (and even fine-tuning for a specific task) to better tailor the model towards a specific need. This paper considers 2: instruction tuning and alignment tuning.
Instruction tuning is demonstrated below. This fine-tuning paradigm aims to tailor the LLM to understand and generate output based on user-inputted instructions. 
﻿
Before instruction tuning, people would take existing datasets in NLP and create instructions from them. This generated a lot of duplicate and alike instruction examples which led to poor performance. InstructGPT proposed to use user queries as a diverse set of instructions. Human labelers would create these queries and human labelers would answer them to provide good demonstrations for the model. 
Scaling the number of instructions, as you might think, will improve performance. What's also crucial is the diversity of this set of instructions and the format each instance is (Fig. 4a). Much like the actual training data, a diverse and good quality set of instances is required. Some work, like Galactica, have mixed instruction tuning with pre-training. Instruction tuning has proved to be monumental to unlocking LLM emergent abilities.
Alignment tuning is tuning the model such that it aligns with human values as LLMs are prone to lots of errors (e.g. biases, harmful generated output, inaccurate responses, etc).  Alignment tuning sometimes has the effect of reducing the performance of the model. This is called the alignment tax.
Alignment tuning follows the principles of infusing models with a sense of helpfulness, honesty, and harmlessness.
To instill these principles into the model, researchers have traditionally adversarially tested the model and corrected it. Human labeling has also been one popular avenue for enforcing the 3 aforementioned principles.
In essence, getting quality labels that align with human values is difficult and it's most often the case that labeled responses are ranked and rule-based methods are employed to filter out the best labeled responses.
The most popular alignment tuning mechanism is Reinforcement Learning with Human Feedback (RLHF). 
RLHF consists of 3 components:
pre-trained LM (think GPT-3 for InstructGPT)
reward model (think a trained LM or an LM trained specifically on human preference data)
RL algorithm (PPO)
﻿
The RLHF process is broken down into 3 stages:
Supervised fine-tuning where human preference data is created.
The LM is fine-tuned then a reward model is trained on this preference data.
Finally, PPO is ran where the states are the current generated tokens, the actions are all the words in the vocabulary of the model, and the reward model is the same as the one trained in stage 2. InstructGPT included a penalty for divergence, ensuring that the aligned model won't deviate too far from the original version. 
For more details on RLHF, check out the W&B reports in the References section below!
UtilizationAfter pre-training and adaptation tuning, there are a few methods to further improve the prompting like in-context learning and chain-of-thought prompting.
In-context learning (ICL) is the process of prepending a set of demonstrations to a test instance instruction for the model to leverage that context, the demonstrations, in order to predict the new test instance. This emergent ability is seen in only LLMs and the selection of the demonstrations is crucial for performance. Refer to Adaptation Tuning above.
The set of demonstrations used in ICL, much like in instruction tuning, is also very critical. There are heuristic-based and LLM-based methods to select the best set of demonstrations. 
Heuristic approaches use a k-NN to select semantically similar examples to the test query. LLM-based approaches predict the informativeness of possible demonstrations. 
After selecting the examples, they must be organized into a natural language prompt format. There are many ways to do this: instantiate a pre-defined template, Auto CoT, use LLM to generate task descriptions to improve the demonstrations, etc. In fact, even the order matters! Order is determined via heuristics. Pre-training diversity and document length may contribute to this ICL ability. 
Chain of Though (CoT) can be used with ICL to improve performance. Diverse and complex CoT instructions tend to work best. Note, CoT improves performance for cases where step-by-step reasoning is crucial.
Capacity EvaluationTo evaluate the ever-growing effectiveness of LLMs, a flurry of benchmarks and tasks have been developed. The paper divides these into language generation, knowledge utilization, and complex reasoning. 
In language generation, there exists datasets like The Pile, but more complex datasets have been made like The LAMBADA dataset where the LLM is in charge of predicting the last word of a paragraph. Conditional language generation tasks like QA, summarization, and the like have also seen immense performance boosts from LLMs (even surpassing human annotators). There are some evident limitations, however. LLMs have issues with long passage generation and contexts where global planning is important.
LLMs perform quite well in open and closed-book QA settings (knowledge utilization). The main issue here is hallucination. Intrinsic hallucination is generated information from the LLM that conflicts with the existing source. Extrinsic hallucination is generated information that cannot be verified. There also arises the issue of how new the training data is. 
As for complex reasoning, LLMs still fall behind humans, though they are catching up quite fast! LLMs struggle with being consistent, especially in complex step-by-step reasoning tasks. They also struggle with numerical computation
ConclusionThere are a few more sections about evaluation and future directions if you're interested, but this wraps up this gigantic survey paper! This paper was well-timed, especially in this current golden era of LLMs and AGI talk. It took a few days to write and dismantle the paper! I hope you learned a bit, as I certainly did. 
Check out the references for some cool articles from W&B about LLMs!
Thanks for reading! 👋
Recommended Reading﻿﻿
Implementing RLHF: Learning to Summarize with trlX
Implementation of Reinforcement Learning with Human Feedback for text summarization task using CarperAI's trlX framework.
RLHF: Hyperparameter Optimization for trlX
In this article, we follow a step-by-step guide for performing hyperparameter optimization using Ray Tune and Weights & Biases, looking at how trlX can help.
An Introduction to Training LLMs Using Reinforcement Learning From Human Feedback (RLHF)
In this article, we explore Reinforcement Learning from Human Feedback, a novel approach to reducing bias and increasing performance in large language models.
Prompt Engineering LLMs with LangChain and W&B
Join us for tips and tricks to improve your prompt engineering for LLMs. Then, stick around and find out how LangChain and W&B can make your life a whole lot easier.
Testing GTP3.5 vs. GPT4: Which Model Writes Better Code?
In this article, we compare outputs from GPT-3.5_turbo and GPT-4, and explore how to use GPT-4 as a code assistant, using a simple CLI termGPT to access the models. 
Unboxing ChatGPT: A Deep-Dive on How This AI-Driven Chatbot Was Trained
In this article, we explore ChatGPT from OpenAI, including how it works, how it was trained, and the role reinforcement learning from human feedback (RLHF) plays.
﻿
﻿