Evaluating Generative Models with GPTScore
GPTScore, a dynamic, multi-faceted approach to evaluating generative models on a variety of tasks and aspects.
Created on February 10|Last edited on February 12
Comment
Background
This paper introduces a metric to measure the performance of a generative NLP model on a variety of ML tasks. Here's my high-level understanding of this paper.
Work in creating metrics for measuring how well generative models with respect to certain metrics that are more abstract (think bias, or fairness, or relevance, etc) are difficult to measure. They fall into two main categories: manual and automated. Some of these methods are complex and require a lot of overhead work. Automated metrics, in fact, can't be solely trusted so they are also evaluated with meta-evaluators. These meta-evaluators check how well the automated metric aligns with human-based/manual metrics.
The general mathematical formulation for these types of abstract generative model metrics is:
Where,
- y = the metric value
- f = a function that can be manual or automated that evaluates the given text given a specific aspect/criterion and some context information and outputs a metric value
- h = text you want to evaluate
- a = some aspect or criterion you want to evaluate it on (e.g. evaluate this piece of text for how relevant it is)
- S = some context information which can be the source (the original document where was generated from) or reference text (a summary of the original document where was generated from)
Their Approach
Summary

Figure 1. Different aspects and the NLP task they correspond to (plus a definition of what that aspect is aiming for).

Figure 2. Different aspects grouped by NLP task and their prompt templates.
Given:
- where is the number of tokens of the text you want to evaluate; so is some encoded text
- a = aspect (refer to Figure 1)
- many different aspects like factuality, consistency, etc
- aspects are within the scope of an NLP task
- definitions/questions of these aspects are also provided (these are very akin to the task descriptions)
d = task description (refer to Figure 2)
- Each task and aspect pair has a unique description like "Generate a <aspect> summary" or "Rewrite the following ..."
- The full lines of instructions you see in Figure 2 are the prompt templates T
- Notice how these prompt templates have very similar wording for each NLP task but a short description is substituted out for each aspect
- the skeleton words for the prompt templates for an NLP task is the task description
- e.g. "Generate a summary <aspect definition> for the following text:"
- the short description that is substituted out is the aspect
- S = context information (think just another huge array of encoded words/paragraphs)
- = the weights of the evaluator LLM
Calculate:
Where,
- = the weight associated to the token (they keep equal across all tokens)
- T = a prompt template (refer to Figure 2)
The formula: calculate the sum of weighted log probabilities across all tokens given the previous tokens, and a specific aspect, and the model weights .
What is it doing?
- The generative LLM is given a prompt template (some context info on which will be the input, an aspect, and a task description); this basically means the model will know what aspect to look for (for the correct task) and is given some context to evaluate
- The model then predicts the log-likelihood of the next token given the previous tokens and the prompt template T.
- These are weighted and summed to output 1 number: the GPTScore.
What is this model?
It's any of the generative LMs. The significance here is that they don't train a model specifically for evaluation —their approach is lightweight and train-free and customizable. Then how do these pre-trained models come out of the box ready to evaluate other models? Not exactly.

Given a set of aspects and a task description (and possibly context information, though that's unclear from the diagram), compile that into a prompt template. Reformat this template slightly to include demonstrated good examples. Add the test/evaluation sample to the end and leverage in-context learning.
Finally, feed these inputs to the model so it can generate probabilities for the evaluation sample. The GPTScore is then computed with these probabilities in the formula above!
For information on their results, check out the paper! They have lots of tables!
References
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.