A Survey on In-Context Learning: The Paper
A birds-eye view of in-context learning in LLMs.
Created on September 3|Last edited on January 3
Comment
In this blog post, I'll be covering A Survey on In-context Learning, a paper published by MOE Key Lab of Computational Linguistics, Shanghai AI Lab and the University of California.
We'll look at
What is In-Context Learning?
In-Context Learning (ICL) is prepending your prompt with a set of demonstrations/examples such that the model could apply its learnings from them to a test example (all on the fly; the model's weights aren't changed).
In-context learning offers a couple of unique advantages:
- training-free framework
- interpretable interface (the examples you provide are all in a natural language)
- very similar in idea to how humans learn by analogy

It's been demonstrated that in-context learning works well in pretraining and is sensitive to prompt templates, examples, wording, and ordering of examples. A warmup training period where you tune the model parameters slightly or add parameters between pretraining and ICL inference aims to enhance the ICL ability.
Performing In-Context Training

Supervised In-Context Learning: constructing in-context training data and multitask training
MetaICL is a meta-training framework that, in summary, continually tunes a model after pretraining such that, come time for inference, it can pick up tasks more easily. Another paper uses symbolic tuning, a method where the labels of a classification task, such as positive/negative, are replaced with arbitrary symbols, such as foo/bar, in an in-context input-label pair. Thus, the model jointly learns sentiment analysis and utilizes in-context knowledge. Instruction tuning, a supervised learning approach, enhances the model's ICL ability.
Self-supervised learning methods also exist to enhance an LLM's in-context learning ability. The key idea both of these paradigms share is that they intend to bridge the gap between pretraining and leveraging ICL during inference. In-context learning and instruction tuning both enhance the model's ability to learn from context, hinting at the great improvements to be made after pretraining. The authors suggest including a warmup stage after pretraining and before inference. In both paradigms, model performance on warmups plateau, indicating they only need a small amount of data to be warmed up.
Demonstration Design
The Demonstration Design section of the paper covers how demonstrations/examples for in-context learning should be designed (e.g. order, type of examples, the examples themselves, and the format and number of examples). The authors divide this section into 2: demonstration organization and demonstration selection.
Many unsupervised methods have been made, including:
- KATE, which is a k-NN selecting samples based on sentence embeddings between examples other work use LMs to generate demonstrations.
- EPR uses a 2-stage retrieval method followed by a scoring LM to label candidate demonstrations as positive examples if they yield a high score when concatenated with the input and fed into the scoring LM. For example ordering, GlobalE & LocalE is a combination of 2 entropy metrics used to score a certain permutation of prompt examples. APE, Automatic Prompt Engineer, is a method for automatically generating prompts for a given task. And of course, Chain of Thought (CoT) is used during inference to encourage the model to reason through intermediate steps before arriving at a final answer. More prompt engineering methods are in my other article!
Self-ask is a method for prompting the model to generate questions based on the input, answer them, and feed them back in as context to the model.

The authors have 5 key takeaways:
- demonstration selection work has mostly been instance-level (meaning they focus on how to order a set of examples best), and corpus-level selection is underexplored.
- output score/probability distribution of LLMs plays a key role in selecting instances
- permutation space is k! and not feasible to consider all possibilities.
- CoT improves model reasoning, and work in understanding how to improve CoT prompting ability is a promising avenue.
- LLMs can be used to generate examples therefore ridding the need for human labor in generating templates.
Scoring Function
To understand this section, we need to cover 2 equations.
The above states:
The probability of a candidate output (a class label or open text) for an input is by definition equal to the scoring function for some pretrained language model with inputs , (an optional instruction followed by input-output pairs/examples written in natural language), and input .
The above states:
is equal to the (in the pool of candidate outputs ) that maximizes the scoring function above.
Basically, is estimating how likely the output is given context and the input . is the most likely output. You can see why having this capability might help quantify how well our model is performing in in-context learning!

The takeaway from these different methods is that there is still a lot of work to be done in creating a scoring function that mitigates sensitivity and reduces bias. In a nutshell, it isn't easy to calibrate how well in-context learning performs. This field is still very new, and standard metrics have not been established.

Above is a table of factors that play into in-context learning.
They've concluded the following findings in the pretraining stage:
- domain source is more important than corpus size
- corpora related to downstream tasks don't necessarily improve ICL ability
- lower perplexity != better ICL
- ICL emergent ability comes after some number of pretraining steps and model size
During the inference stage:
- input-label formatting matters
- exposure of label space (what labels you use as examples)
- input distribution
- order of examples
- examples with embeddings close to query embedding
The takeaway from numerous works attempting to understand why in-context learning works is:
- most studies so far are limited to simple models
- more work perhaps in gradient-based methods can be promising

Benchmarks for ICL.
Evaluating in-context learning performance is difficult.
- traditional evaluation tasks must be adapted to few-shot settings
- OpenICL is an initial attempt at evaluating ICL
Because of the tremendous success of in-context learning in NLP, it has also been applied to other fields like computer vision. SegGPT uses a GPT model that unifies a diverse set of segmentation tasks and other work combined diffusion models and the in-context learning capabilities of LLMs. Flamingo and Kosmos-1 are examples of multi-modal models leveraging LLMs. VALLE-X, an example of a speech LLM, demonstrates strong performance in cross-lingual text-to-speech synthesis and in speech-to-speech translation tasks.
The authors note that findings in in-context learning strategies for NLP cannot be directly transferred to other modalities.
Overall, ICL has great potential in data engineering (generating labels), model augmenting (retrievers), and knowledge updating (I've had to pass in-context documentation several times when using ChatGPT for bugs!).
We still have much to uncover about in-context learning as the components (examples, number of examples, etc) involved can drastically oscillate the performance anywhere from random guess to SOTA. It is still a very new ability, and there has yet to be a gold standard for quantifying in-context learning performance and understanding just how it works. There is also the case of efficiency; longer and longer contexts used in prompts only exacerbate the time needed for inference.
Conclusion
This wraps up my summary on "A Survey on In-context Learning". I hope you enjoyed this, and don't forget to check out my other articles on prompt engineering and LLM surveys! Thanks! 👋
References
Prompt Engineering for LLMs: A Practical, Conceptual Guide
Exploring what actually works in prompt engineering
Surveying Survey Papers on the LLM Era
Surveying a practical survey guide on LLMs from ChatGPT onward and understanding the directions of LLM research!
A Survey of Large Language Models
In this article, I summarize the finding the in the March 31, 2023 paper "A Survey of Large Language Models".
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.