Surveying Survey Papers on the LLM Era

Surveying a practical survey guide on LLMs from ChatGPT onward and understanding the directions of LLM research!
Created on September 3|Last edited on September 3
Comment
﻿
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and BeyondPractical Guide for ModelsPractical Guide for DataPractical Guide for NLP TasksA Bibliometric Review of Large Language Models Research from 2017 to 2023ConclusionReferences
﻿
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
﻿
I've already covered one survey paper on LLMs in this blog post, but I'll cover a couple more here! That being said, let's begin. 
Practical Guide for ModelsThe above diagram (which is the staple of this paper) showcases the release times of encoder-only, encoder-decoder, and decoder-only transformer-based models. It's crazy to think that this much work has been done based off of the architecture introduced in the famous Attention is All You Need paper. The authors note a couple of interesting observations:
decoder-only models have been growing tremendously in popularity with encoder-only models fading away
OpenAI is dominant in the LLM industry
Meta is consistently open-sourcing their LLMs and LLM work
the introduction of GPT-3 led to more closed-source models, potentially for commercial use
encoder-decoder models may be promising
A table of popular LLMs for each of the three categories is listed below.
﻿
The authors describe 2 styles of LLMs:
BERT-style: Masked Language Modeling (MLM) where an LM is trained to fill-in-the-blank of a sentence (certain words are masked in a sentence)
GPT-style: Autoregressive LMs boomed with GPT-3
Practical Guide for DataThroughout the paper, the authors include certain remarks, essentially findings.
Remark 1
LLMs > fine-tuned LMs (the authors consider models <20B parameters to be not LLMs) on out-of-distribution data 
LLMs > fine-tuned LMs if limited annotated data; if not limited annotated data, then either can work depending on the objective and factors at hand
advisable to pick models pretrained on similar data for downstream tasks (transfer learning-esque)
The authors go on to consider 3 scenarios for annotated data availability: 
no annotated data: using an LLM zero-shot is most suitable
few annotated data: using LLM with in-context learning; though techniques exist where LMs (not LLMs) can be finetuned on minimal data
abundant annotated data: finetuned LMs are viable but, of course, LLMs are also just as viable here (though there are pros and cons for each that aren't about data availability)
The authors remark that LLMs perform quite well with test/user data that are often out-of-distribution. What's a good way of summarizing the decision flow chart? Conveniently, the authors provided one!
﻿
Practical Guide for NLP TasksRemark 2
Finetuned language models generally perform better at traditional Natural Language Understanding (NLU) tasks like Named Entity Recognition (NER), though LLMs can help.
This is evidenced by:
sentiment analysis: finetuned LMs = LLMs on IMDB and SST
toxicity detection: finetuned LMs > LLMs on CivilComments
(NLI) RTE and SNLI: finetuned LMs > LLMs, LLMs = finetuned LMs on CB 
QA: finetuned LMs > LLMs on SQuADv2 and QuAC
NER: finetuned LMs >> LLMs CoNLL03
 In summary, finetuned LMs are generally better than LLMs on traditional benchmarks where annotated data is ample, and the test/user data is not far from the training data. The authors suggest the LLMs struggle is in part due to the structuring of the prompts.
However, there are still some NLP tasks that LLMs excel in: miscellaneous text classification (out-of-distribution prone) and adversarial NLI (LLMs possess stronger generalizability). 
Natural Language Generation encompasses 2 categories: 
converting input texts to new symbol sequences (task-specific generation like summarizing a paragraph; the model only generates summaries)
open-ended generation
Remark 3
Because LLMs are more generalizable and creative, they are dominant in natural language generation.
On summarization tasks XSUM and CNN/DailyMail, Brio and Pegasus perform better than LLMs w.r.t. the ROUGE metric, but human evaluation tends to still prefer LLM output to finetuned model outputs. This may also reflect the lack of exemplar summaries (exemplar w.r.t. human judgment) in the datasets.
LLMs are competent in machine translation (MT) and their performance can be further improved by including more multilingual data in their pretraining
rich resource MT scenarios, the finetuned LMs slightly beat LLMs
low resource MT scenarios, finetuned LMs significantly beat LLMs
Remark 4
LLMs are strong in knowledge-intensive tasks
they falter when the knowledge doesn't align with their learned knowledge or when contextual knowledge is needed
This is evidenced by:
Closed-book QA: NaturalQuestions, WebQuestions, and TriviaQA where LLMs beat finetuned LMs
however, in cases where the dataset works against the knowledge the model knows (or the LLM simply hasn't learned that knowledge yet) the model falls behind finetuned LMs
an alternative is retrieval augmentation which is essentially a "memory bank" database for the model to retrieve information from
Remark 5
with exponential scaling, models grow more adept in commonsense and arithmetic reasoning
emergent abilities
in some cases, scaling does not improve the model's performance
In arithmetic reasoning, it's been shown that two-digit addition becomes apparent after 13B parameters. On a handful of benchmarks (GSM8k, SVAMP, AQuA), LLMs show competitive performance to task-specific methods. GPT-4 has even demonstrated to outclass all other methods. Chain-of-Thought prompting also plays a role in improving the model's performance.
Emergent abilities are abilities that arise in these LLMs that aren't in pretrained language models. These abilities tend to be surprising and allow the model to perform a greater variety of tasks. The authors use the example of GPT-3 being able to unscramble a word or write the word from a reversed form of it.
The Inverse Scaling Phenomenon is where the model's scaling meets a degradation in performance. The U-shaped Phenomenon is where the model's performance improves until it dips a bit before improving again.
Remark 6
Finetuned models are relevant in tasks different from the pretraining of LLMs
LLMs can be used as quality evaluation in NLP tasks
The next section covers "real-world" tasks which may be subject to malformed data inputs, unclear task definitions, and failure to understand implicit intent (say the user describes what they want incorrectly or they worded it weird).
Remark 7
LLMs >> finetuned LMs in real-world but evaluating LLMs is still an open problem
Some capabilities like alignment tuning and instruction tuning further boost the LLMs prowess in the real-world.
Remark 8
if cost and strict latency are issues, finetuned models with PEFT (parameter-efficient  finetuning) are better
LLMs still demonstrate a bit of shortcutting issues when it comes to task-specific datasets
safety concerns in LLMs are a top priority as their output is open-ended
Some statistics:
GPT-1 sits at 117M parameters, GPT-2 at 1.5B, and GPT-3 at 175B parameters
training 11B T5 model costed $1.3M for a single run
single training run for GPT-3 175B costed about $4.6M
PaLM consumed 3.4 GWh in about 2 months
OpenAI partnered with Microsoft and Azure, leveraging their immense supercomputer of 285k CPU cores and 10k high-end GPUs
LLMs are too large to be on a single computer so they are provided as services via APIs
PEFT can be used to train a model for a substantially smaller cost; common methods include Low Rank Adaptation (LoRA), Prefix Tuning, and P-Tuning
accuracy and robustness are correlated in LLMs but finetuning an LLM on task-specific data may lead to overparameterization and overfitting
LLMs perpetuate societal biases
LLMs are vulnerable to label bias and making shortcuts even in zero-shot scenarios
LLMs can advertently or inadvertently generate harmful content and they can hallucinate, generate false information
In conclusion, the authors list a few areas of interest:
evaluating models on both benchmarks and real-world datasets
model and safety alignment problem with LLMs 
predicting performance scaling
A Bibliometric Review of Large Language Models Research from 2017 to 2023Being the second paper in this survey of surveys, I 'll try my best to not repeat information already present in the first paper. For this paper, I'll illuminate the interesting results. Let's begin!
﻿
Their workflow for sourcing publications is shown above. They used Web of Science (WoS) Core Collection to retrieve publications based off of article titles and topics with the following query.
﻿
From their query, they obtained 5.7k publications between the 2017 and 2023. Then they used BERTopic.
encoding titles and abstracts as embedding vectors using SBERT
reducing the dimensionality with UMAP
clustering with k-Means
Bag of Words (BoW) for counting frequency
c-TF-IDF (class-based TF-IDF) to extract difference of topical keywords
They did a whole lot more than what I just listed and, frankly, I didn't understand all of it! But the results they generated, I do understand!
They narrowed down to 200 topics categorized into 5 buckets:
Algorithm and NLP tasks
Medical and Engineering Applications
Social and Humanitarian Applications
Critical Studies
Infrastructures
Publication frequency exploded since 2018 and especially so with ChatGPT. Most publications are for algorithms and NLP tasks.
There are no obviously distinct clusters. Infrastructure focuses on distributed computing and hardware and accelerators.
(a) show general NLP concepts and tasks. (b) seems to focus on pretraining and NLP techniques. (c) focuses on articles like fake news and hate speech.
Most papers published in North America, Asia-Pacific, and Europe. The authors found that USA and China are at the forefront of algorithms and NLP tasks research.
Universities play a large role in LLM research but large tech companies have grown to become significant contributors in their own right.
This paper had some interesting findings! Most survey papers cover the technical aspects of research, but this one identifies the trends in the bigger picture, the research field as a whole and its general direction and presence in the real world.
ConclusionWell, this concludes my breakdown of 2 survey papers. One covers the research of LLMs and the other identifies trends in LLM research over the past 6 years from 2017 to 2023! I hope you found this insightful. Thank you for reading! 👋
References
Also check out this blog post where I cover another survey paper!
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
﻿https://arxiv.org/abs/2304.13712﻿
﻿https://arxiv.org/abs/1706.03762﻿
﻿https://ai.stanford.edu/blog/understanding-incontext/﻿
﻿https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460﻿
﻿https://huggingface.co/datasets/xsum﻿
﻿https://paperswithcode.com/dataset/cnn-daily-mail-1﻿
﻿https://arxiv.org/abs/2203.16804﻿
﻿https://arxiv.org/abs/1912.08777﻿
﻿https://arxiv.org/ftp/arxiv/papers/2304/2304.02020.pdf﻿
A Bibliometric Review of Large Language Models Research from 2017 to 2023
﻿https://arxiv.org/abs/2304.02020﻿
﻿https://clarivate.com/products/scientific-and-academic-research/research-discovery-and-workflow-solutions/webofscience-platform/web-of-science-core-collection/﻿
A Survey of Large Language Models
In this article, I summarize the finding the in the March 31, 2023 paper "A Survey of Large Language Models".
﻿
﻿
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.