Skip to main content

Evaluating Large Language Models (LLMs) with Eleuther AI

In this article, we explore the need for tools to evaluate Auto-regressive Language Models, dig into the most-used metrics, and also do an lm-eval ourselves.
Created on November 7|Last edited on December 21
In this article, we'll look at how to evaluate large language models (LLMs) with the help of the lm-eval package from the folks at Eleuther AI. We'll begin with why this is necessary, then dig into the evaluation metrics commonly used in the field before getting our hands dirty with the lm-eval.
Here's what we'll be covering:

Table of Contents



Let's get to it.

Introduction to LLM Evaluation

Recent advances in NLP research, such as the introduction of Transformer models, have undoubtedly contributed to significant progress in a wide range of language-related tasks.
Few studies, however, investigate the robustness and reproducibility of their evaluation methodologies. While the size and capabilities of Large Language Models (LLMs) have grown dramatically in recent years, so has a concern about how these models and their training and evaluation data should be evaluated.
A rigorous and stringent evaluation process helps practitioners understand the capabilities and limitations of the models they develop. However, because LLMs are frequently trained on massive amounts of data and can perform various tasks in zero-shot, one-shot, and few-shot settings, evaluating them presents significant challenges. For example, the evaluation performance of LLMs in particular is extremely sensitive to exact prompting. Additionally, many assessment benchmarks lack clear design standards, making it easy for an evaluation to be implemented in ways that the benchmark authors did not intend.
Systematic analysis and reproducible evaluation results, in particular, that highlight the effects of training data leakage complicate matters because a model's evaluated performance is dependent not only on its architecture but also on the training data. In reality, different preprocessing techniques used before analysis, as well as various versions of an evaluation dataset, create reproducibility issues, prompting practitioners to copy evaluation results rather than rerunning them.
In other words, the evaluation of these large models is fairly challenging. Let's move on to how that's typically done.

Evaluation Metrics

Traditionally, language model performance is measured by measuring perplexity, cross entropy, and bits-per-character (BPC). Large Language Models, on the other hand, have been shown to outperform these benchmarks and unlock new abilities such as arithmetic, few-shot learning, and multi-step reasoning. Nevertheless, these LLMs are not without flaws, exhibiting biases and producing plausible misinformation.
While the traditional intrinsic metrics are extremely useful during the process of training the language model itself they fail in providing a way to compare and benchmark LLMs. As language models continue to get bigger and used in more real-world applications, it is important that they are not increasingly getting worse or harming users in yet-undetected ways.
Recent benchmarks address these issues by testing LLMs for logical and common sense reasoning, dataset-specific bias, the ability of models to keep track of information, and downstream tasks without task-specific gradient optimization. A few examples of such benchmarks are CoQA, LAMBDA, HELLASWAG, LogiQA. These benchmarks provide methods for evaluating LLMs for mismatches between the behavior we want LLMs to exhibit and the behavior we observe in practice as a result of the training objectives and data we use.
The tools needed to assess LLMs must provide a rigorous, robust, and reproducible framework for practitioners to replicate and investigate published state-of-the-art results. The framework must also provide a mechanism for normalizing and decontaminating the training data by removing biased, extraneous, and unwanted elements. Finally, the framework should be capable of detecting and reporting different evaluation versions and preventing data leakage into validation and test datasets.
Enter lm-eval.

What is LM-Eval?

The lm-eval python package released by the EleutherAI–a grassroots community of researchers working to enable open-source AI research–provides one such framework for LLM evaluation. With a flexible and tokenization-agnostic interface, the library provides a single framework for evaluating and reporting auto-regressive language models on various Natural Language Understanding (NLU) tasks. There are currently over 200 evaluation tasks that support the evaluation of models such as GPT-2 ,T5, Gpt-J, Gpt-Neo, Gpt-NeoX, Flan-T5.

Installation and Use of LM-Eval

The evaluation harness is installable via pip. To install the package simply run pip install lm-eval . Once installed you can use the simple command line interface to perform an evaluation of models in zero-shot, one-shot and few-shot settings.
A basic evaluation can be run with the following command:
python main.py \
--model gpt2 \ # A huggingface model type
--device 0 \ # device use for evaluation
--tasks lambada,hellaswag # comma separated list of tasks to evaluate
This evaluates the GPT-2 (117M) model from Hugging Face models by default. Other arguments that can be passed to the script are:
  • model: Name of the model (gpt2, gpt3, t5)
  • model_args: Arguments to pass to the Hugging Face AutoModel for initialization
  • tasks: A comma separated list of tasks to evaluate the model on.
  • num_fewshot: Number of examples in the few-shot context to pass to the model.
  • batch_size: Number of examples in a single batch of evaluation
  • device: The device ids to run the evaluation on. (e.g. "cpu" or "cuda:0")
  • no_cache: Whether or not to cache the model results during evaluation.
  • description_dict: A dictionary of custom task descriptions of the form: task_name: description
  • decontamination_ngrams_path: path to a directory containing n_gram duplicates used to decontaminate the test set.
The script can also be used to evaluate OpenAI's GPT-3 model. You can simply add your the environment variable: OPENAI_API_KEY and run the script with the argument --model=gpt3

Development and Customization

The package also allows task development and customization. This promotes open contribution and already has over 200 tasks listed! The library is organized in a very simple and intuitive structure and consists of 2 main components:
  • Models - e.g. (GPT-2, GPT-3, T5)
  • Tasks - e.g. (MNLI, SQUAD, Lambda)
Both components are stored in a registry data structure for easy initialization via the command line. Let's look at both of these in further detail:

Models

The sub-package lm-eval.models includes models supported by the library. Currently, the following models are registered in the package:
MODEL_REGISTRY = {
"hf": gpt2.HFLM, # huggingface compatible causal language models
"gpt2": gpt2.GPT2LM, # huggingface GPT-2 models and variants
"gpt3": gpt3.GPT3LM, # OPENAI GPT-3 model - requires an OPENAI API KEY
"textsynth": textsynth.TextSynthLM, # models supported by textsynth.com API
"dummy": dummy.DummyLM, # Baseclass that implements the models API - can be subclassed to extend to custom models
}
To implement a new model, simply subclass the `BaseLM class from lm-eval.base and override the properties and methods according to your custom models' implementation. Finally, register your subclass in the MODEL_REGISTRY above. This ensures that the CLI script can initialize the model.

Tasks

The Task class is the foundation of all natural language tasks in the lm-eval package. It contains everything you’d need to perform a few-shot evaluation of any autoregressive language model. Although there are over 200 pre-implemented tasks, the library also provides a step-by-step guide to subclass the Task class for evaluation of new or custom tasks.
More concretely, we can create new tasks using the task templates provided by the library. Depending on the type of task you are trying to evaluate you can either create a new task or a new multiple-choice task. The templates come with #TODO instructions that help set up the custom Task. In short, you are required to add code to perform the following tasks.
  1. Downloading the task data via HuggingFace datasets library
  2. Processing Documents for the task, e.g. whitespace removal, de-tokenization
  3. Formatting examples for few-shot evaluation.
  4. Decontamination to remove any training data overlap.
  5. Register the task in the task registry.


Task Versioning

To help improve reproducibility, all tasks have a VERSION field. When run from the command line, this is reported in a column in the table, or in the "version" field in the evaluator return dict.
The purpose of the version is so that if the task definition changes (i.e to fix a bug), then you can know exactly which metrics were computed using the old buggy implementation to avoid unfair comparisons. Task versions start at 0, and each time a breaking change is made, the version is incremented by one. When reporting results, it's also recommended to report the version of each task.

Task Description

Recent language models can often perform instruct-based few-shot inference. This is where in addition to examples in a few-shot prompt the task description is also provided to the model.
For example, in machine translation, the instruction looks something like Translate from English to German. This can be provided as a description dictionary to the evaluation script on a per-task basis. The authors provide a useful guide to setting up task descriptions for evaluation.

Decontamination

During the evaluation, we are often concerned with the model's generalization capabilities. However, since most LLMs are often trained on very large internet datasets such as the PILE it's likely that the test set for a task is present in the model training dataset, this is known as data leakage or contamination. To alleviate this issue, the package also provides a decontamination tool that follows the methodology defined in OpenAI's "Language Models are Few-Shot Learners." A test document is contaminated if any N-gram overlap existed with any training document. For simplicity, the library defines the N-gram range as 13. This provides a useful way to measure the impact of test set leakage by detecting the contaminated test examples and producing a clean version of the benchmark.

Human Evaluation

While automatic evaluation provides a way to evaluate, compare and measure the performance of language models, most benchmarks don’t evaluate abilities such as creativity, humor and engagingness.
As language models become a core component of many practical applications, it's also important to measure their performance using human evaluators. For instance, here's a post that discusses 7 categories where LLMs have concrete applications and their performance using human evaluators. Such evaluation highlights not only the limitations of language models today but also points out the importance of paying attention to evaluation datasets.

Conclusion

With Large Language Models(LLMs) becoming a key component of almost all new NLP and NLU pipelines, it's imperative for practitioners to perform and report evaluations in a robust and standardized way. The lm-eval package provides the essential tools to carry out and report evaluations in a standard format. This is instrumental in not only fairly evaluating progress but also in extending the development of LLMs in areas where they are not yet performant.

Iterate on AI agents and models faster. Try Weights & Biases today.