Skip to main content

Surveying Survey Papers on the LLM Era

Surveying a practical survey guide on LLMs from ChatGPT onward and understanding the directions of LLM research!
Created on September 3|Last edited on September 3


Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond


I've already covered one survey paper on LLMs in this blog post, but I'll cover a couple more here! That being said, let's begin.

Practical Guide for Models

The above diagram (which is the staple of this paper) showcases the release times of encoder-only, encoder-decoder, and decoder-only transformer-based models. It's crazy to think that this much work has been done based off of the architecture introduced in the famous Attention is All You Need paper. The authors note a couple of interesting observations:
  • decoder-only models have been growing tremendously in popularity with encoder-only models fading away
  • OpenAI is dominant in the LLM industry
  • Meta is consistently open-sourcing their LLMs and LLM work
  • the introduction of GPT-3 led to more closed-source models, potentially for commercial use
  • encoder-decoder models may be promising
A table of popular LLMs for each of the three categories is listed below.

The authors describe 2 styles of LLMs:
  • BERT-style: Masked Language Modeling (MLM) where an LM is trained to fill-in-the-blank of a sentence (certain words are masked in a sentence)
  • GPT-style: Autoregressive LMs boomed with GPT-3

Practical Guide for Data

Throughout the paper, the authors include certain remarks, essentially findings.
Remark 1
  • LLMs > fine-tuned LMs (the authors consider models <20B parameters to be not LLMs) on out-of-distribution data
  • LLMs > fine-tuned LMs if limited annotated data; if not limited annotated data, then either can work depending on the objective and factors at hand
  • advisable to pick models pretrained on similar data for downstream tasks (transfer learning-esque)
The authors go on to consider 3 scenarios for annotated data availability:
  • no annotated data: using an LLM zero-shot is most suitable
  • few annotated data: using LLM with in-context learning; though techniques exist where LMs (not LLMs) can be finetuned on minimal data
  • abundant annotated data: finetuned LMs are viable but, of course, LLMs are also just as viable here (though there are pros and cons for each that aren't about data availability)
The authors remark that LLMs perform quite well with test/user data that are often out-of-distribution. What's a good way of summarizing the decision flow chart? Conveniently, the authors provided one!


Practical Guide for NLP Tasks

Remark 2
Finetuned language models generally perform better at traditional Natural Language Understanding (NLU) tasks like Named Entity Recognition (NER), though LLMs can help.
This is evidenced by:
  • sentiment analysis: finetuned LMs = LLMs on IMDB and SST
  • toxicity detection: finetuned LMs > LLMs on CivilComments
  • (NLI) RTE and SNLI: finetuned LMs > LLMs, LLMs = finetuned LMs on CB
  • QA: finetuned LMs > LLMs on SQuADv2 and QuAC
  • NER: finetuned LMs >> LLMs CoNLL03
In summary, finetuned LMs are generally better than LLMs on traditional benchmarks where annotated data is ample, and the test/user data is not far from the training data. The authors suggest the LLMs struggle is in part due to the structuring of the prompts.
However, there are still some NLP tasks that LLMs excel in: miscellaneous text classification (out-of-distribution prone) and adversarial NLI (LLMs possess stronger generalizability).
Natural Language Generation encompasses 2 categories:
  • converting input texts to new symbol sequences (task-specific generation like summarizing a paragraph; the model only generates summaries)
  • open-ended generation
Remark 3
Because LLMs are more generalizable and creative, they are dominant in natural language generation.
On summarization tasks XSUM and CNN/DailyMail, Brio and Pegasus perform better than LLMs w.r.t. the ROUGE metric, but human evaluation tends to still prefer LLM output to finetuned model outputs. This may also reflect the lack of exemplar summaries (exemplar w.r.t. human judgment) in the datasets.
  • LLMs are competent in machine translation (MT) and their performance can be further improved by including more multilingual data in their pretraining
  • rich resource MT scenarios, the finetuned LMs slightly beat LLMs
  • low resource MT scenarios, finetuned LMs significantly beat LLMs
Remark 4
  • LLMs are strong in knowledge-intensive tasks
  • they falter when the knowledge doesn't align with their learned knowledge or when contextual knowledge is needed
This is evidenced by:
  • Closed-book QA: NaturalQuestions, WebQuestions, and TriviaQA where LLMs beat finetuned LMs
  • however, in cases where the dataset works against the knowledge the model knows (or the LLM simply hasn't learned that knowledge yet) the model falls behind finetuned LMs
  • an alternative is retrieval augmentation which is essentially a "memory bank" database for the model to retrieve information from
Remark 5
  • with exponential scaling, models grow more adept in commonsense and arithmetic reasoning
  • emergent abilities
  • in some cases, scaling does not improve the model's performance
In arithmetic reasoning, it's been shown that two-digit addition becomes apparent after 13B parameters. On a handful of benchmarks (GSM8k, SVAMP, AQuA), LLMs show competitive performance to task-specific methods. GPT-4 has even demonstrated to outclass all other methods. Chain-of-Thought prompting also plays a role in improving the model's performance.
Emergent abilities are abilities that arise in these LLMs that aren't in pretrained language models. These abilities tend to be surprising and allow the model to perform a greater variety of tasks. The authors use the example of GPT-3 being able to unscramble a word or write the word from a reversed form of it.
The Inverse Scaling Phenomenon is where the model's scaling meets a degradation in performance. The U-shaped Phenomenon is where the model's performance improves until it dips a bit before improving again.
Remark 6
  • Finetuned models are relevant in tasks different from the pretraining of LLMs
  • LLMs can be used as quality evaluation in NLP tasks
The next section covers "real-world" tasks which may be subject to malformed data inputs, unclear task definitions, and failure to understand implicit intent (say the user describes what they want incorrectly or they worded it weird).
Remark 7
  • LLMs >> finetuned LMs in real-world but evaluating LLMs is still an open problem
Some capabilities like alignment tuning and instruction tuning further boost the LLMs prowess in the real-world.
Remark 8
  • if cost and strict latency are issues, finetuned models with PEFT (parameter-efficient finetuning) are better
  • LLMs still demonstrate a bit of shortcutting issues when it comes to task-specific datasets
  • safety concerns in LLMs are a top priority as their output is open-ended
Some statistics:
  • GPT-1 sits at 117M parameters, GPT-2 at 1.5B, and GPT-3 at 175B parameters
  • training 11B T5 model costed $1.3M for a single run
  • single training run for GPT-3 175B costed about $4.6M
  • PaLM consumed 3.4 GWh in about 2 months
  • OpenAI partnered with Microsoft and Azure, leveraging their immense supercomputer of 285k CPU cores and 10k high-end GPUs
  • LLMs are too large to be on a single computer so they are provided as services via APIs
  • PEFT can be used to train a model for a substantially smaller cost; common methods include Low Rank Adaptation (LoRA), Prefix Tuning, and P-Tuning
  • accuracy and robustness are correlated in LLMs but finetuning an LLM on task-specific data may lead to overparameterization and overfitting
  • LLMs perpetuate societal biases
  • LLMs are vulnerable to label bias and making shortcuts even in zero-shot scenarios
  • LLMs can advertently or inadvertently generate harmful content and they can hallucinate, generate false information
In conclusion, the authors list a few areas of interest:
  • evaluating models on both benchmarks and real-world datasets
  • model and safety alignment problem with LLMs
  • predicting performance scaling

A Bibliometric Review of Large Language Models Research from 2017 to 2023

Being the second paper in this survey of surveys, I 'll try my best to not repeat information already present in the first paper. For this paper, I'll illuminate the interesting results. Let's begin!

Their workflow for sourcing publications is shown above. They used Web of Science (WoS) Core Collection to retrieve publications based off of article titles and topics with the following query.

From their query, they obtained 5.7k publications between the 2017 and 2023. Then they used BERTopic.
  • encoding titles and abstracts as embedding vectors using SBERT
  • reducing the dimensionality with UMAP
  • clustering with k-Means
  • Bag of Words (BoW) for counting frequency
  • c-TF-IDF (class-based TF-IDF) to extract difference of topical keywords
They did a whole lot more than what I just listed and, frankly, I didn't understand all of it! But the results they generated, I do understand!
They narrowed down to 200 topics categorized into 5 buckets:
  • Algorithm and NLP tasks
  • Medical and Engineering Applications
  • Social and Humanitarian Applications
  • Critical Studies
  • Infrastructures
Publication frequency exploded since 2018 and especially so with ChatGPT. Most publications are for algorithms and NLP tasks.
There are no obviously distinct clusters. Infrastructure focuses on distributed computing and hardware and accelerators.
(a) show general NLP concepts and tasks. (b) seems to focus on pretraining and NLP techniques. (c) focuses on articles like fake news and hate speech.
Most papers published in North America, Asia-Pacific, and Europe. The authors found that USA and China are at the forefront of algorithms and NLP tasks research.
Universities play a large role in LLM research but large tech companies have grown to become significant contributors in their own right.
This paper had some interesting findings! Most survey papers cover the technical aspects of research, but this one identifies the trends in the bigger picture, the research field as a whole and its general direction and presence in the real world.

Conclusion

Well, this concludes my breakdown of 2 survey papers. One covers the research of LLMs and the other identifies trends in LLM research over the past 6 years from 2017 to 2023! I hope you found this insightful. Thank you for reading! 👋

References

Also check out this blog post where I cover another survey paper!
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
A Bibliometric Review of Large Language Models Research from 2017 to 2023

Iterate on AI agents and models faster. Try Weights & Biases today.