Skip to main content

LLM evaluation metrics: A comprehensive guide for large language models

Learn how to evaluate Large Language Models (LLMs) effectively. This guide covers automatic & human-aligned metrics (BLEU, ROUGE, factuality, toxicity), RAG, code generation, and W&B Guardrail examples.
Created on May 3|Last edited on May 4
Large Language Models (LLMs) have revolutionized AI with their ability to generate human-like text, but evaluating their performance remains a critical challenge. As LLMs are deployed in real-world applications – from chatbots and content generators to code assistants – how do we measure their quality, correctness, and safety?
In this report, we’ll explore the key evaluation metrics for LLMs, covering both automated quantitative metrics and human-aligned criteria. We’ll discuss popular metrics like perplexity, BLEU, ROUGE, BERTScore, MAUVE, and accuracy, as well as qualitative metrics such as faithfulness, factuality, toxicity, helpfulness, and harmlessness. We’ll also examine use-case-specific metrics (for RAG, summarization, chatbots, code generation), compare the strengths and weaknesses of various approaches, and highlight best practices for aligning metrics with human judgment.
By the end, you’ll have a clear understanding of how to evaluate LLM outputs in a rigorous yet practical way – ensuring your model is not only performant but also aligned with user needs and ethical guardrails. Let’s dive in!

Why do we need LLM evaluation metrics?

Evaluating LLMs is essential to ensure models are reliable, accurate, and safe before deploying them in any application. Unlike traditional software, LLMs can produce unpredictable outputs, so we need metrics to systematically quantify their performance.
Robust evaluation helps:
  • Compare models and prompts: Determine which model (or model version) performs better on a given task.
  • Track improvements: See if fine-tuning or prompt tweaks actually improved results over time.
  • Ensure quality and safety: Catch issues like factual errors, incoherence, or toxic content before they reach end-users.
  • Meet requirements: Verify the model meets business or user-defined criteria (e.g. a support chatbot must solve customer queries correctly and respectfully).
However, evaluating LLMs is harder than evaluating simpler ML models. Traditional metrics (like accuracy for classification) often fall short for generative tasks – there may be multiple valid outputs, and aspects like fluency or helpfulness are subjective. Human evaluation is the gold standard but is costly, slow, and sometimes inconsistent. This has led to a growing toolkit of automatic metrics and frameworks designed for LLMs, each with its own pros and cons.
In the sections below, we’ll break down the landscape of LLM evaluation metrics into categories and specific examples, then look at how to choose the right ones for your use case.

Categories of LLM evaluation metrics

When evaluating LLMs, it’s helpful to consider two broad categories of metrics:
  • Automatic Metrics: Quantitative scores computed by algorithms or models. These require no human in the loop and often compare the LLM output against a reference or use an intrinsic measure. Examples: perplexity, BLEU/ROUGE scores, BERTScore, MAUVE, exact match accuracy.
  • Human-Aligned Metrics: Qualitative judgments reflecting human preferences or values. These include clarity, coherence, helpfulness, harmlessness, etc., and are often obtained via human raters or learned proxies (like another LLM “judge”). Examples: toxicity level, factuality/faithfulness, helpfulness ratings.
Another useful distinction is by task type:
  • Generative task metrics: For open-ended text generation (e.g. stories, dialogs) or tasks like summarization and translation. These often use reference-based metrics (compare output to a reference text) or distribution-based metrics (assess output quality by comparing to human language distribution).
  • Classification/decision task metrics: For tasks where the LLM produces a specific answer or class (e.g. multiple-choice Q&A, code correctness, tool usage). These can use metrics like accuracy, precision/recall, F1, or domain-specific success rates (e.g. test pass rates for code).
To summarize, Table 1 categorizes some common metrics by whether they are automatic vs. human-aligned and the kind of task they are suited for:
Table 1: Common LLM evaluation metrics, categorized by type and use case.
MetricTypeUse Case
PerplexityAutomatic (intrinsic)Language modeling (generative quality). Lower is better (model is less “surprised” by the text).
BLEUAutomatic (reference)Translation, summarization (generative). Measures n-gram overlap precision.
ROUGEAutomatic (reference)Summarization (generative). Measures n-gram overlap recall.
BERTScoreAutomatic (reference)Any text generation with reference. Uses embedding similarity for semantic overlap.
MAUVEAutomatic (distribution)Open-ended generation. Compares model vs human text distribution quality.
Exact Match / Acc.Automatic (outcome)Q&A, classification, code tests. Checks if output exactly matches correct answer or passes tests.
ToxicityAutomatic (safety)Any generative output. Classifier score for hateful/offensive content (lower is better).
Bias/FairnessAutomatic (safety)Any output. Flags biased or derogatory content (e.g. sexist or racist remarks).
Factuality / Faith.Auto or Human-Aligned (quality)Factual tasks, summarization, RAG. Measures correctness of facts and absence of hallucination. Can use refs or judgment.
Coherence & FluencyHuman-Aligned (quality)Any text. Judges if output is well-structured and grammatically fluent. Often rated by humans or learned model.
HelpfulnessHuman-Aligned (preference)Chatbots, assistants. Human or LLM judge rating of how well the response addresses user needs.
HarmlessnessHuman-Aligned (preference)Chatbots, any AI. Rating of whether the content avoids harm (no toxicity, bias, violence, etc.), aligns with ethical guidelines.

Automatic metrics are computed algorithmically, while human-aligned metrics reflect human judgment (often via surveys or AI proxies). Some metrics can be partially automated (e.g. factuality via a database check) but may still require human verification for nuanced cases.
In practice, both categories are important. Automatic metrics provide objective, repeatable scores and are great for tuning models during development. Human-aligned metrics ensure the model’s output is meeting the qualitative expectations of end users and society (correct, understandable, non-harmful).
Many evaluation frameworks, including W&B Weave, combine these – logging automatic scores for each LLM output and also incorporating human feedback loops when needed. W&B Weave uses "Scorers" that can act as Guardrails during evaluation, such as automated checks for Toxicity, PII (Personally Identifiable Information), Context Relevance, and Hallucination.
Next, let’s delve deeper into specific metrics in each category, how they work, and their pros/cons.

Automatic evaluation metrics for LLMs

Automatic metrics let us quantitatively evaluate LLM outputs without human intervention. They range from classical intrinsic metrics (like perplexity) to reference-based overlap metrics (like BLEU/ROUGE), embedding-based metrics (like BERTScore), and more specialized measures.
Below we explain each major type:

Perplexity – how well does the model predict text?

Perplexity (PPL) is a fundamental metric for language models that gauges how “surprised” the model is by a given text. Formally, it is defined as the exponentiated average negative log-likelihood of a sequence under the model. Intuitively, a lower perplexity means the model assigns higher probability to the test text, indicating better predictive power (and often better fluency). For example, if a model has a perplexity of 20 on a dataset, it’s less confident (more surprised) than a model with perplexity 10.
Perplexity is often used to evaluate base LLMs on held-out data without requiring any reference output – the model’s own probability distribution serves as the yardstick. It’s especially relevant when pre-training or fine-tuning language models (lower perplexity on validation data usually correlates with a better model fit). However, perplexity alone doesn’t tell the whole story for user-facing quality – a model can have low perplexity (predicting text like the dataset) but still produce irrelevant or unsafe answers in conversation. It’s best used to compare language modeling ability of models or track training progress.
How to calculate perplexity: Modern libraries can compute perplexity easily. For a causal LLM, you can take the cross-entropy loss on a test corpus and exponentiate it. For example, using Hugging Face Transformers:
from transformers import AutoModelForCausaLLM, AutoTokenizer
import torch

model = AutoModelForCausaLLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, how are you?"
enc = tokenizer(text, return_tensors="pt")
with torch.no_grad():
# Compute negative log-likelihood loss
loss = model(**enc, labels=enc["input_ids"]).loss
perplexity = torch.exp(loss)
print(f"Perplexity: {perplexity.item():.2f}")

This snippet loads a GPT-2 model and calculates its perplexity on the sample text. A lower perplexity (closer to 1) would indicate the model finds the text very predictable, whereas a high value means the text was surprising to the model. In practice, large well-trained LLMs achieve low perplexities on typical language data, but any specific prompt’s perplexity can vary with how well it matches the model’s training distribution.
When to use: Use perplexity to evaluate intrinsic language modeling quality. It’s great for comparing language models or gauging if a model has been fine-tuned effectively (e.g., if fine-tuning on domain data lowers perplexity on domain test set). It’s not task-specific and doesn’t require references or human labels. But note: perplexity is only defined for probabilistic models that can compute likelihoods (usually autoregressive LMs). It doesn’t apply to models like BERT (masked LMs), and it won’t directly tell you if a response is correct or helpful – just how fluent/predictable the text is.

N-gram overlap metrics (bleu, rouge, etc.) – how much does output match a reference text?

For many language generation tasks, especially those with a fairly defined target output (like translation or summarization), overlap-based metrics are standard. These metrics compare the model’s output to one or more reference texts (assumed to be high-quality human-written answers) by measuring overlapping subsequences. The most common are:
  • BLEU (Bilingual Evaluation Understudy): Originally developed for machine translation, BLEU evaluates how many n-grams in the generated text appear in the reference text. It computes a precision score for n=1 up to 4 (typically) and combines them (with a brevity penalty to discourage outputs that are too short). A higher BLEU means closer word overlap with the reference. BLEU works best when there’s a relatively fixed phrasing expected (e.g. translations). However, it can be overly strict – e.g., using a synonym or paraphrase will lower BLEU even if the meaning is preserved.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Commonly used for summarization tasks, ROUGE measures recall of n-grams – essentially, what percentage of the reference’s n-grams appear in the output. ROUGE-N is for n-gram overlap, and ROUGE-L measures longest common subsequence overlap. High ROUGE means the model’s summary covered a lot of the reference summary’s content. Like BLEU, it’s superficial – it doesn’t account for meaning, just exact overlaps. Still, ROUGE is a decent proxy for content coverage in summarization and is easy to calculate.
  • METEOR: A metric that goes beyond exact n-grams by including stemming and synonym matching. METEOR calculates precision and recall of unigram matches, but unlike BLEU, it tries to align words and can use a thesaurus (like WordNet) to count synonyms as matches. It often correlates better with human judgments than BLEU on some tasks, thanks to considering synonyms and partial credit for matches. It outputs a score (0-1) like BLEU/ROUGE.
  • Exact Match & F1 (for QA): In tasks like extractive QA or closed-book QA, a simple but effective metric is Exact Match – does the model output exactly the correct answer string – and F1 score – which accounts for partial overlaps (treating the answer and output as bag-of-words sets). These are essentially overlap metrics at the answer level. For example, if the question’s true answer is “George Washington” and the model says “Washington”, that’s a partial match (F1 < 1.0, Exact Match = 0). These metrics are popular in question-answering benchmarks.
Pros: N-gram overlap metrics are straightforward and language-agnostic. They have been the backbone of evaluation in MT and summarization for years. They don’t require training a separate model – just string comparisons. A high BLEU/ROUGE often roughly correlates with the output being relevant and on-topic.
Cons: These metrics don’t capture meaning – they only reward literal overlap. As a result, they can mislead: a perfectly valid rephrasing can get low scores, and conversely, a fluent but content-empty output might cheat a metric by copying lots of words from the reference. Indeed, studies have shown poor correlation of BLEU/ROUGE with human judgments for many generative tasks, especially open-ended ones. They also require reference outputs, which may not exist for every task or could themselves be imperfect.
When to use: Use BLEU for translation or tasks where exact wording matters. Use ROUGE for summarization as a quick check of content coverage. Always remember these are shallow metrics – useful for regression tests (did my tweak drop BLEU?) or optimizing in early training, but not sufficient alone to guarantee quality. If using these, consider pairing with a metric that checks semantics or factuality.

Embedding-based semantic similarity (bertscore & beyond) – do the output and reference mean the same thing?

In response to the limitations of n-gram metrics, newer metrics leverage pre-trained language model embeddings to measure similarity in a semantic sense. The prime example is BERTScore:
  • BERTScore: This metric computes similarity by embedding both the candidate output and the reference text using a model like BERT (or Roberta, etc.) and then matching tokens in one to tokens in the other. Instead of exact word matches, it uses cosine similarity between embedding vectors of words. For each word in the candidate, BERTScore finds the most similar word in the reference (and vice versa) to compute a precision and recall, then a combined F1 score. The idea is that if the output conveys the same meaning as the reference, the embeddings (which capture context and semantics) will be similar even if the exact words differ. For example, “The cat sits on the mat” vs “A feline rests upon a rug” might have low BLEU, but a high BERTScore because “cat” is close to “feline” in embedding space, “mat” to “rug”, etc. A perfect semantic match would score 1.0.
BERTScore has been shown to correlate better with human judgment than BLEU on many tasks, as it can catch synonymy and differences in phrasing. It’s especially helpful for longer, more free-form texts where exact overlap is too strict. Variants like BLEURT (which uses a fine-tuned BERT on human rating data) go further in this direction. Other embedding metrics include MoverScore (uses Earth Mover’s Distance on embeddings) and metrics that use newer models (e.g. embedding from GPT-3). These all try to capture the semantic overlap or adequacy of the generated text relative to a reference or source.
Pros: Embedding-based metrics are more forgiving to wording differences and can reward semantic similarity. They often achieve higher correlation with human evaluations of quality and relevance. They can leverage powerful language models’ understanding, effectively doing a shallow form of “AI judging AI” based on meaning.
Cons: They rely on the quality of the underlying embedding model – if it has biases or blind spots, the metric inherits them. For instance, BERTScore might not work well for domains far from what BERT was trained on, or languages BERT isn’t good at. Also, these metrics are less interpretable – a number comes out, but it’s not obvious which differences contributed. And like overlap metrics, they generally require reference outputs for comparison (except in special setups where you compare output to input for consistency, etc.).
When to use: BERTScore (or similar) is great for evaluating summaries, translations, or paraphrases when you care about semantic equivalence and have reference texts. It’s often used in research papers to complement BLEU/ROUGE. If your application can tolerate the extra compute, embedding metrics are a good default for generative tasks with references, as they better capture meaning. They are also useful in Retrieval-Augmented Generation (RAG) to compare model output with retrieved passages (to check if the model used the info correctly). You can compute BERTScore easily with existing libraries. For example:
!pip install bert_score # install the bert-score library

from bert_score import score

candidates = ["The cat sits on the mat"]
references = ["A feline rests upon a rug"]
P, R, F1 = score(candidates, references, lang="en", verbose=True)
print(f"BERTScore F1: {F1[0]:.4f}")

This would output a high BERTScore F1 (close to 1) for the semantically similar sentences, even though lexically they differ. In contrast, a BLEU might be near 0 for this example. This illustrates how BERTScore focuses on meaning over exact wording.

Generative quality metrics (mauve and others) – how close is the output distribution to human-like?

Not all generative tasks have a single reference text to compare against. For open-ended generation (chatbot conversations, story generation, etc.), we often care about overall quality and diversity of outputs. Enter distribution-based metrics like MAUVE:
  • MAUVE: MAUVE is a metric designed to measure the gap between the distribution of machine-generated text and human text. It effectively compares two sets of text samples – one from the model, one from human-written data – and computes a divergence score (it uses a fancy method with clustering in an embedding space and computing KL divergences). The MAUVE score (ranging 0 to 1) indicates how well the model’s distribution approximates human-like language: higher MAUVE = more human-like generation. Unlike BLEU which needs one reference per output, MAUVE looks at overall distributions. It’s particularly useful for open-ended tasks like long-form text generation, dialogues, or anywhere you can sample many outputs and want to evaluate them collectively.
Research has found that MAUVE correlates strongly with human evaluation of open-ended text quality, more so than traditional metrics. Essentially, if your model outputs have a good mix of richness and coherence similar to real text, MAUVE will be high.
Other distribution metrics: Self-BLEU is a simpler one sometimes used to gauge diversity – it computes BLEU of each generated output against other generated outputs (low self-BLEU means outputs are diverse). Perplexity can be used in reverse: evaluate a known good model’s perplexity on the generated outputs – if the good model finds them likely, they might be high-quality. These are more ad-hoc though. MAUVE is a more principled approach.
Pros: Distribution metrics don’t require a ground-truth for each prompt; they assess the quality of the model’s language as a whole. They can capture things like a model mode-collapsing (producing very similar outputs for everything) or being noticeably dissimilar from human style.
Cons: They are less interpretable and granular. A MAUVE score tells you generally about the model, but not about a specific output. Also, computing such metrics requires a sample of outputs and possibly a separate language model to embed them (in MAUVE’s case, it uses an LLM’s embedding space). They might not reflect task success if the task isn’t just “sound human-like” (e.g., a model could be on-topic or factual or not – MAUVE might not directly account for factual correctness, just fluent realism).
When to use: Consider MAUVE when evaluating creative or open-ended generation, like story generators or conversational models, where you can sample a set of outputs and want an overall quality measure. It’s also useful when comparing two large models or decoding methods: which produces text more similar to human writing? For instance, MAUVE can highlight differences between sampling strategies (nucleus vs. beam search) by checking which yields more human-like distribution. Use it as a complement to task-specific metrics: MAUVE for general quality plus something for factuality or relevance if needed.

Task-specific automatic metrics (accuracy, etc.) – did the model get the correct answer or complete the task?

Some LLM applications boil down to getting a correct answer or executing a task, where we can define an objective success criterion. In such cases, we can use classic metrics:
  • Accuracy / Error Rate: If the LLM is used for classification or to pick one of a few options (e.g. multiple-choice questions, next action in an agent, picking a tool), we can measure accuracy (% of outputs that are exactly correct). For example, if using an LLM to take a standardized test (like MMLU or another benchmark), the accuracy on those questions is a clear metric of performance.
  • Precision/Recall/F1: If the task involves identifying items or producing lists (like extracting names from text, or multi-label classification), precision and recall can be computed comparing the set of items output vs the ground truth set.
  • Exact Match (EM): Often used in QA evaluations (e.g., SQuAD dataset), it’s a strict measure: does the model’s answer exactly match the ground truth answer string (after normalization)? It’s a harsher criterion than accuracy in multiple-choice because any deviation or wording difference counts as 0.
  • Code correctness metrics: For code generation tasks, automatic evaluation often uses execution-based metrics. A common one is pass@k – the percentage of problems for which at least one of the model’s k generated solutions passes all unit tests (i.e., is a correct program). For instance, Codex might be evaluated on pass@1 or pass@10 on a suite of coding challenges. This is effectively an accuracy measure but allowing multiple tries. Similarly, if the code can be executed, you might measure runtime errors vs. successful runs.
  • Tool use success: In agent scenarios, if an LLM is supposed to call tools (APIs, calculators), you can define metrics like Tool Success Rate (did the model call the correct tool and get the right result?). The Confident AI guide, for example, defines Tool Correctness as whether the agent used tools appropriately. This often reduces to an accuracy measure on a set of scenarios.
These task-oriented metrics are very binary in nature – success or failure. They are immensely useful wherever applicable because they directly measure the end-goal (e.g., did the user get the right answer or a working solution?).
Pros: When you can use a definitive metric like accuracy or pass@k, it provides clear, actionable feedback. There’s no ambiguity: an answer is either correct or not. These metrics align well with user satisfaction in domains where correctness is paramount (e.g., math problems, factual QA, code solutions).
Cons: Not every task has a ground-truth answer. Many generative tasks don’t have a single “correct” output, so accuracy would be meaningless. Also, strict accuracy doesn’t capture nuance (an answer might be partially correct but formatted differently – accuracy would mark it wrong unless you add leeway). For code generation, execution metrics can be slow to compute (you have to run code, which might be unsafe without sandboxing).
When to use: Always use objective metrics like accuracy or EM if your task permits it. For example, if you’re building a medical QA system with a set of gold-standard answers, track accuracy or F1 to know if the model is giving correct info. If you have a chatbot that should classify intents or route to the correct answer from a knowledge base, measure how often it chooses the right knowledge entry. Essentially, for any evaluation dataset where you can label outputs as correct/incorrect, use classification metrics – they are straightforward and usually well-understood by stakeholders.

Human-aligned metrics: quality, safety, and beyond

While automatic metrics are invaluable, they often miss the human element – is the response actually good from a user’s perspective? Does it follow instructions? Is it free from harmful content? Human-aligned metrics are those that capture these aspects, either via direct human evaluation or via models trained to mimic human judgments. Here we cover key qualitative metrics and how they’re evaluated:

Coherence, fluency, and relevance – does the output make sense and flow well?

These are fundamental qualities for any generated text:
  • Coherence: A coherent response is logically consistent and well-structured. It should stay on topic and not contradict itself. Coherence can be assessed by humans by reading the output for logical flow. Automatic proxies include checking if the output stays relevant to the prompt (no random tangents) and whether it follows a sensible narrative or reasoning. W&B Weave, for instance, includes a ContextRelevance scorer which can function as a guardrail to ensure the output is on-topic with the input or provided context. For multi-turn chats, coherence also means maintaining context from prior turns.
  • Fluency: This refers to the grammar, syntax, and style – basically, does the text read like good, natural writing in the target language? A fluent response has no obvious grammatical errors or awkward phrasing and follows normal usage. Fluency is often high for modern LLMs (they’re quite good at local language modeling), but it can degrade if the model is stressed or if it’s producing something it’s less sure about. There are automated grammar checkers (like LanguageTool or GPT-based checkers) that can rate fluency, but human reading is the gold standard.
  • Relevance: Especially for tasks like Q&A or dialogue, we ask: Did the model actually address the user’s query or the task prompt? A response can be perfectly fluent and coherent yet completely irrelevant or evasive. Ensuring relevance means the model’s answer stays on the topic asked and provides information that’s useful for the prompt. In RAG systems, this ties to using provided context (is the answer grounded in the retrieved documents?). Metrics like answer relevancy, such as W&B's ContextRelevance scorer, are often implemented via LLM-as-a-judge or heuristics to see if the question’s keywords or intent appear in the answer.
These qualities are usually measured via human evaluation forms where raters score outputs on a scale (e.g., 1-5 for coherence, etc.). Recently, an alternative is to use an LLM as a judge – e.g., prompt GPT-4 with the conversation and ask it to score coherence or give a ranking. Such LLM-based evaluators have shown promising correlation with human ratings, though care is needed to avoid biases.
Why they matter: If an LLM output is incoherent or irrelevant, it’s basically failing even if some metrics like BLEU don’t catch it. These are basic quality filters. In production systems, one might implement guardrails to catch incoherent answers (for example, detect if the answer is gibberish or doesn’t contain any keywords from the question, and then re-try or default to a safe response).
W&B Guardrails example: Weave Guardrails provides out-of-the-box quality scorers, such as ContextRelevance, that can flag an output if it’s off-topic or poorly formed, allowing developers to then handle those cases (maybe by re-prompting or returning an error message). This helps maintain a baseline of answer quality in deployed LLM applications.

Factuality and faithfulness – is the content correct and grounded in truth?

One of the biggest issues with LLMs is that they can “hallucinate” – i.e., produce plausible-sounding but incorrect information. We define two related metrics:
  • Factuality: Measures whether the statements the model makes are factually correct in the real world. For instance, if asked “Who is the President of France in 2023?” a factually correct answer is “Emmanuel Macron.” An answer like “Jean Dupont” would be factually incorrect. Factuality is hard to assess automatically because it requires a knowledge source or ground truth. Techniques include checking the output against a knowledge base or Wikipedia (for factual QA), or using a separate fact-checking model. Datasets like TruthfulQA also attempt to quantify how often a model gives true vs. false answers for tricky questions.
  • Faithfulness: Often discussed in context of summarization or RAG, faithfulness means the model’s output does not introduce unsupported information and sticks to the provided source content. For example, a summary is faithful if everything in it is derived from the original document (no made-up facts), even if the original document might not be universally true. In RAG, an answer is faithful if it is grounded in the retrieved documents and doesn’t inject outside hallucinated info. Essentially, it’s a measure of not hallucinating relative to source. An unfaithful output would be one that contains details not present in the input source.
Both factuality and faithfulness deal with hallucination: factuality in a broad sense (truth to the world), faithfulness in a narrow sense (truth to given context). Various metrics and tools exist to assess these:
  • String match checks: In RAG, a simple metric is Precision@k of source citation – e.g., check if the answer contains sentences or facts that can be found in the top-k retrieved documents. If not, likely a hallucination. Some use an overlap score between answer and sources to gauge groundedness.
  • Question generation + QA (Q^2): One clever metric for faithfulness (for summaries) is to generate questions from the summary and see if the original text can answer them. If the original text can’t provide the answer given in the summary, the summary probably hallucinated that detail. This approach, called Q^2, yields a score of consistency.
  • Fact-check models: There are classifiers fine-tuned to detect factual inconsistencies. For instance, FactCC was a model for checking if a summary is consistent with an article. Similarly, one can prompt GPT-4: “Is the following statement correct? ...” to use it as a fact-checker (with caution). W&B Weave also offers scorers that can act as guardrails for Hallucination and Faithfulness, providing automated checks against provided context or known facts.
  • Human eval: Ultimately, human experts sometimes need to verify facts. This is common in domains like medical or legal, where every statement might need verification.
The Confident AI metrics list includes Correctness (whether output is factually correct based on ground truth) and Hallucination (whether output includes made-up info) as core metrics. These align exactly with factuality and faithfulness.
Pros: Ensuring factuality/faithfulness is crucial for user trust. A model that is coherent but confidently wrong can be dangerous. These metrics directly target that failure mode. Automated checks can catch obvious errors (names, dates, etc.).
Cons: It’s extremely challenging to automate. The model might say something that is unverifiably true or partially true. Human judgment is often needed for subtle cases. Also, for creative tasks, “factuality” might be less relevant (a fictional story isn’t fact-checked, for instance). There’s also the issue of the evolving truth (if an LLM was trained a year ago, it might state an outdated fact – is that a hallucination or just training cutoff? Typically it’s considered an error relative to current truth).
When to use: Always consider a factuality metric if your LLM is supposed to provide information or summarize content. For example, in a customer support bot summarizing account info, faithfulness to that info is key. In a medical Q&A, factual accuracy can be life-critical. Use available tools: retrieval augmentation, guardrails like “don’t answer if not sure”, and metrics (including automated ones like W&B's Hallucination scorer) to evaluate on a test set how often the model’s statements are true. Even though full automation is hard, tracking a proxy metric for factuality (like a rate of hallucination on a known set of questions) is very valuable.

Toxicity and harmlessness – does the model avoid harmful or offensive content?

LLMs can sometimes produce inappropriate or harmful text, intentionally or not. Toxicity generally refers to content that is hate speech, harassment, profanity, or otherwise offensive/discriminatory. Harmlessness is a broader term used in alignment literature, meaning the model avoids causing harm – this includes toxicity but also things like not giving instructions for dangerous activities or not encouraging self-harm, etc. Essentially, a harmless model respects ethical and safety guidelines.
How to measure toxicity: The common approach is to use a toxicity classifier (like OpenAI’s moderation endpoint or Google’s Perspective API) to assign a toxicity score to the output. For instance, Perspective API returns scores for various categories (toxicity, insult, threat, etc.). W&B Weave Guardrails include a Toxicity scorer that flags if a response contains hateful/toxic content, often used as an evaluation metric or a real-time block. Another important guardrail often used in evaluation is PII (Personally Identifiable Information) detection, ensuring the model doesn't leak sensitive data.
Bias and offensive content: Closely related, we often check for biased or hateful outputs targeting protected groups. A model might not use slurs (toxicity), but could still exhibit bias (e.g., making unfair assumptions about a gender or ethnicity). Specific metrics like a sexism classifier or more general bias score can be applied. For example, the model outputs can be analyzed for sentiments towards demographic-identifying phrases (there’s research metrics like regard score, etc.). Companies often have internal benchmarks: e.g., feed the model templated prompts (“As a [gender], I think ...”) and see if outputs differ or contain stereotypes.
Harmlessness evaluations: Anthropic’s HH-RLHF work defined “Harmlessness” as avoiding a wide range of harmful behaviors. To evaluate this, human labelers were asked to label if any given output was harmful or violated a policy. You can similarly create a checklist of disallowed content (violence, sexual content, extremism, etc.) and have either humans or classifiers check outputs against it. Ideally, your evaluation should also include some adversarial prompts to test the model’s guardrails.
Pros: Monitoring toxicity and similar safety metrics is essential for real-world deployment. These metrics help ensure your LLM doesn’t produce content that could harm users or a company’s reputation. Automatic detectors, like W&B's Toxicity or PII scorers, are reasonably good at flagging overt issues, enabling fast scanning of lots of outputs during evaluation.
Cons: Classification isn’t perfect – sometimes benign content gets flagged (false positives) and sometimes subtle harmful content slips by (false negatives). Also, cultural context matters: what’s considered offensive can vary. Over-reliance on automated metrics may miss nuance. And an LLM might learn to hide toxic content in more polite wording, evading detection but still conveying harm (e.g., dog whistles or veiled insults).
When to use: Always use toxicity and related safety metrics (like PII detection) when evaluating a general-purpose LLM, especially one that will interact with users. Even if your application is not “open domain” (say it’s a coding assistant), it’s wise to include some prompts that test the model’s responses to provocative or inappropriate requests and measure how it responds. Combine multiple metrics if possible (toxicity, bias, harassment) to get a broad picture of content safety. For a deployed system, aim for zero tolerance on these – if any outputs are flagged in testing, that indicates a need for better filtering or model fine-tuning.

Helpfulness and user satisfaction – is the model effectively addressing user needs?

Going beyond not being harmful, we want our LLMs to be actively helpful and aligned with user intent. In instruction-following and chatbots, helpfulness is a key metric used in Reinforcement Learning from Human Feedback (RLHF). It measures how well the model’s response fulfills the user’s request, is clear and informative, and overall leaves the user satisfied. This is inherently a subjective metric – usually captured by human ratings. For example, human labelers might rank which of two responses is more helpful for a given question (as done in training InstructGPT).
How to evaluate helpfulness: If you have the resources, the best way is to conduct a user study or A/B test: show users (or hired annotators) different model responses and have them rate helpfulness on a scale or choose the best one. These ratings can then be averaged into a score. When that’s not feasible, a proxy is to use an LLM evaluator: e.g., ask GPT-4 “Rate the helpfulness of the assistant’s answer on a scale from 1 to 5” given the conversation. Interestingly, GPT-4 and similar models, when carefully prompted, can provide ratings that somewhat correlate with average human opinion – this is the idea behind using LLMs as judges in evaluations.
Other human-centric metrics: User engagement (does the user ask follow-up questions, or do they abandon?), readability, politeness/tone, and humor/creativity for certain applications. In customer support settings, one might measure CSAT (customer satisfaction) via surveys after an interaction with an AI agent. Harmlessness vs. Helpfulness trade-off: Sometimes making a model very safe (never says anything possibly offensive) can reduce its helpfulness (maybe it refuses too often).
There’s a balancing act. Anthropic talks about a three-axis “HHH” – Helpful, Honest, Harmless – model quality. Honesty (truthfulness) we covered under factuality; helpfulness and harmlessness we’re discussing here. In evaluation, you often want to measure all these axes because optimizing only one can hurt the others.
Pros: Human feedback-based metrics ultimately capture the end-user experience which pure automated metrics cannot. A model with a slightly lower BLEU but higher helpfulness (users prefer its answers) is the better model! These metrics directly drive model improvement via RLHF – the model is tuned to maximize them.
Cons: They require humans in the loop, which is expensive and slow. Even using an LLM as a judge, you’re incurring extra compute and it’s not guaranteed to match all facets of human preference (there could be biases – e.g., an LLM judge might overly reward verbose answers). Also, helpfulness is context-dependent: a correct but terse answer might be seen as less helpful than a more verbose friendly answer in a general chat, but in a professional QA setting, brevity might be preferred. So you have to define carefully what “helpful” means for your use case and ensure evaluators know that.
When to use: In fine-tuning or choosing between model variants for user-facing applications, always incorporate some form of human preference evaluation. This could be as simple as you yourself testing outputs, or as structured as a large-scale human annotation project. If launching a chatbot, beta test it with users and gather feedback on answers. For ongoing quality monitoring, you might not rate every output but perhaps sample some interactions and have them reviewed. If using RLHF or preference modeling, you’re by definition using these metrics to train; for evaluation, you might report the model’s win-rate vs. a baseline in side-by-side comparisons (e.g., “humans preferred our model’s answer over GPT-3’s answer 70% of the time” – a helpfulness comparison metric).

Use-case-specific metrics and best practices

Different applications of LLMs demand different evaluation focus. Let’s look at a few common scenarios and the metrics especially relevant to each:

Summarization metrics

When evaluating summaries generated by an LLM (for news articles, documents, transcripts, etc.), the key aspects are usually informativeness and faithfulness. Common metrics and approaches:
  • ROUGE scores: As mentioned, ROUGE is the de facto standard for summarization evaluation. A high ROUGE means the summary has high overlap with a reference summary, which usually implies it covered similar content. It’s good for measuring coverage of important points.
  • BERTScore / Embedding similarity: To allow for more paraphrasing, many works report BERTScore for summaries, as it captures similarity in meaning even if phrasing differs.
  • Faithfulness checks: Summaries are notorious for potential hallucinations (adding details not in source). So evaluating faithfulness is critical. One might use a model like FactCC, or an LLM judge that compares the summary to the source and rates factual consistency. Alternatively, the Q^2 approach (question generation) can be automated to some extent. Automated tools like W&B's Hallucination scorer can also be applied here.
  • Compression & coverage metrics: Some works look at how much of the source was retained or compressed. For example, the ratio of summary length to source length (a basic metric, but too high compression might miss info, too low might just copy). Coverage can mean did the summary include all the key facts (which you might define via a set of reference facts).
  • Coherence & readability: A summary should read well as a standalone text. While usually high-level metrics don’t explicitly capture this, human eval often includes a score for coherence/fluency of the summary.
When to use: For summarization tasks, always report at least ROUGE (since it’s expected in literature, and it does indicate if you dropped major info) and at least one metric of faithfulness if possible. If you can, do human eval where evaluators mark if a summary has inaccuracies. A practical approach: use ROUGE to pick top candidates, then have humans pick among high-ROUGE ones which are best (since ROUGE will only get you so far). In production, consider an automated hallucination checker on summaries (for instance, W&B Guardrails could run a Hallucination scorer on each summary to flag ones that might need review).

Retrieval-Augmented Generation (rag) metrics

In RAG systems, the LLM is augmented with a retrieval step that provides relevant documents (e.g., from a knowledge base) to ground the answer. Here we need to evaluate two parts: the retrieval quality and the final answer quality. Important metrics:
  • Retrieval relevance (Recall@k / Precision@k): How good are the retrieved documents? If the system is asked a question answerable by some document in the corpus, did the retriever find that doc in its top-k results? This is often measured by Recall@k (did the relevant doc appear in top-k). If you have labeled data with which documents contain the answer, you can compute this objectively. If not, you might sample and manually judge if the retrievals look on-topic.
  • Context relevance (to the query): Low context relevance indicates the retrieval might have missed the mark (which usually will lead to a bad answer). Tools like W&B Weave provide ContextRelevance scorers that can automate this check during evaluation.
  • Context completeness: Another RAG-specific idea is completeness – whether the retrieved info covers everything needed. Even if each retrieved doc is somewhat relevant, maybe a key piece was missing. If you have a multi-part question and your documents only answered part, the answer might end up incomplete.
  • Answer faithfulness to context: Given the docs, did the LLM actually use them correctly? A good metric is to check if the answer’s facts appear in the context. If the answer has a sentence that can’t be found in any provided doc, that’s a sign of hallucination. One can automatically highlight which parts of the answer are supported by the context (there are NLP methods for this, or simply overlap and heuristic matching of names/dates). Hallucination rate in RAG can be measured by human labeling on a sample: e.g., “out of 100 questions, 5 answers contained info not present in the retrieval results” => 5% hallucination rate. Automated Faithfulness or Hallucination scorers, like those in W&B Weave, are specifically designed for this RAG evaluation task, checking if the answer is grounded in the provided documents.
  • End-to-end accuracy: If your RAG system is used for QA, ultimately you care if the answer was correct. Sometimes retrieval could fail but the model guesses correctly anyway (rare, but possible if it had training data answer). Or retrieval might succeed but the model still misstates the answer. So it’s worth also evaluating the final answer in the same way you’d evaluate a non-RAG model for that task (accuracy, ROUGE if it’s long answer, etc.). If a correct answer is defined, measure exact match or F1 of the answer itself. This gives the bottom-line performance.
When evaluating RAG, use-case-specific metrics come into play. For example, if it’s a RAG chatbot for customer support, you might measure resolution rate (% of questions answered correctly using the knowledge base) and deflection rate (% of time it had to say “I don’t know”). If it’s a search QA system, maybe you measure answer correctness and also how many tokens from the context were used (to gauge if it’s leveraging the context or ignoring it).
In summary, for RAG evaluate at least: retrieval relevance (did we fetch relevant docs) and answer accuracy/hallucination (using metrics like Faithfulness scorers). There is often a trade-off: a very high Recall@k (pull lots of stuff) could hurt answer precision if the model then gets distracted, whereas too low recall obviously misses answers. Fine-tuning both components and evaluating them jointly and separately is useful.

Chatbot and dialogue metrics

Chatbots (especially open-domain ones like ChatGPT-style assistants) are complex to evaluate because the interaction is multi-turn and the user experience matters as much as individual response quality. Here are some metrics:
  • Conversation Success / Task completion: If the chatbot has a specific purpose (booking a ticket, troubleshooting an issue), you can measure task completion rate – did the chat end in a successful outcome? This often requires defining success criteria for a conversation (maybe the user says “thanks, that answers my question”). Wizard-of-Oz style evaluation can be done where testers interact and then note if their goal was met.
  • Turn-level quality: Similar to coherence/helpfulness at the single response level, but extended to dialogue. Did each answer follow from the last user prompt appropriately? Metrics like Next Utterance Relevance (potentially using a ContextRelevance scorer) can be calculated via an embedding similarity between the response and the conversation context to ensure the bot’s reply is on-topic. Also, length & verbosity can be a metric: does the bot ramble or keep it concise? Depending on preference, there might be an optimal range.
  • Engagement: Does the user remain engaged? For example, average number of turns in a session could be a proxy – if users consistently only say 1 thing and leave, maybe the bot’s responses aren’t engaging. This is a bit product-oriented, but if you have user data it’s a valuable signal.
  • Safety in dialogue: All the safety metrics (toxicity, PII detection etc.) apply here, but in dialogue we also worry about things like the bot revealing private info or getting manipulated by user messages (prompt injections). So you might run specific red-team test conversations to evaluate how the model responds to adversarial or sensitive prompts using automated scorers like W&B's Toxicity checker. For evaluation, you can script some known dangerous prompts and see if the model complies or refuses appropriately. Then report e.g., “Model refused X% of unsafe requests” (higher is better, to a point).
  • Persona & Consistency: If your chatbot has a persona or style, you might subjectively evaluate if it’s consistent. For instance, does it maintain the same tone? There’s a metric in research called Consistent Persona where they check if the bot doesn’t contradict earlier stated facts about itself across a conversation.
  • Human evaluation via chat comparison: Often the most comprehensive evaluation for chatbots is to have humans chat with different bots (or different versions) and blindly rate their experience or choose which chat they preferred. This was done in many chatbot competitions. You could simulate this with LLM judges as well (have GPT-4 simulate a user and then evaluate, though simulating a user is tricky).
In essence, chatbot evaluation is multi-faceted. A structured way is to break it down: evaluate each single-turn in isolation for quality (using metrics we discussed like helpfulness, correctness, relevance, etc.), and also evaluate holistically for things like consistency and user satisfaction.

Code generation metrics

When an LLM generates code (for example, using models like Codex, CodeGen, etc.), we evaluate it on how well the code works and adheres to requirements:
  • Functional Correctness (Pass@k): As mentioned, the primary metric is whether the generated code is correct (does it compile? does it pass all tests for the problem?). Because LLMs can generate multiple attempts, pass@k measures the probability that at least one of k samples is correct. For instance, pass@1 is just accuracy of the first attempt; pass@5 might allow the model to generate 5 solutions and see if any solve it. This is especially useful when generation has some nondeterminism.
  • Error Rate: If code is generated in an interactive setting (like an assistant writing code), you might measure how many errors the user had to fix or how often the model had to retry. A low error rate means the model usually gets it right first time. Some evaluations measure the edit distance between model output and the corrected solution.
  • Code Quality & Style: Beyond just working, is the code clean and well-structured? This is subjective, but you can use linters or formatters as a crude metric. For example, running a PEP8 linter on Python code to see if it follows style guidelines. Or measuring complexity (does the model produce an overly complex solution? maybe count number of lines or cyclomatic complexity).
  • Commenting & Docstrings: If part of the task is to generate documentation, you could measure comment density or completeness of docstrings. Some evaluations require the model to produce an explanation with code, which then you might evaluate using text metrics (like is the explanation accurate, etc.).
  • Security/safety of code: If generating code for real systems, one might also evaluate if the code has obvious vulnerabilities. There’s research on using static analysis tools on AI-generated code to catch issues.
When to use: For any code generation, definitely use an execution-based metric if at all possible. If you have test cases, automatically run the model’s code and record pass/fail.
This is far more informative than just looking at the code text. Also, consider time to fix or hints needed. In user studies, one can measure how much assistance the user needed after the AI’s code (did they have to heavily modify it or just minor tweaks?).
Note: An LLM might produce code that’s syntactically perfect (so perplexity or BLEU on code might be high) but logically wrong. That’s why running it or logically testing it is key.

When to use what: choosing the right metrics

We’ve enumerated many metrics. In practice, you’ll want to choose a small set of metrics (often 3-5) that cover the crucial aspects for your use case. Some general tips:
  • For informative tasks (QA, assistants): combine accuracy/factuality metrics with helpfulness metrics. E.g., measure correctness of answers and have users rate helpfulness. Also include a safety metric like automated Toxicity or PII detection to ensure no harmful outputs.
  • For creative tasks (story generation, open dialog): focus on quality metrics like coherence, fluency, and use a distribution metric (MAUVE) or human preference to see which model is more engaging. Safety still, of course, if public-facing.
  • For transformation tasks (translation, summarization): use reference-based metrics (BLEU/ROUGE/BERTScore) for a quick check, but augment with human eval for nuance like adequacy and fluency. Ensure to check faithfulness for summarization, potentially using automated Faithfulness scorers.
  • For code: prioritize functional tests, then perhaps track edit distance to solution as a secondary. If multiple outputs are allowed, use pass@k.
  • For RAG systems: evaluate the pipeline: retrieval quality (maybe as separate metrics) and final answer quality (accuracy/factuality/faithfulness using metrics like W&B's Faithfulness scorer). If performance is lacking, these metrics can tell you if the bottleneck is retrieval or the generation step.
  • Across all: always keep an eye on safety metrics (toxicity, bias, PII). Even if it’s not the main goal, you don’t want to deploy a model that scores poorly on these. They can be included as “guardrail metrics” (like those offered by W&B Weave) that must be above a threshold to consider the model deployable.
Finally, consider using evaluation frameworks and tools that integrate these metrics. For example, W&B Weave Evaluations allows you to log a suite of metrics for your model outputs and compare across model versions, with support for custom scorers and built-in guardrails like Toxicity, PII, ContextRelevance, and Hallucination.

Strengths and weaknesses of different metrics

To wrap up the metrics overview, it’s worth comparing their strengths/weaknesses side by side:
  • Perplexity: Strengths: Direct measure of language model fit; easy to compute for causal LMs; good for model development (lower PPL generally = better model). Weaknesses: Doesn’t ensure outputs are correct or relevant; not interpretable to non-ML stakeholders; not applicable to all model types.
  • BLEU/ROUGE: Strengths: Easy to understand (n-gram overlap); quick to compute; good for detecting major content alignment with references; established in literature (baseline for many tasks). Weaknesses: Poor at capturing meaning; penalize legitimate variation; can be gamed by lengthy outputs; low correlation with human quality on free-form tasks.
  • BERTScore/Embedding metrics: Strengths: Capture semantic similarity; more tolerant to phrasing differences; better correlation with human judgments on many tasks. Weaknesses: Depend on a specific pre-trained model (may not work equally on all domains/languages); harder to explain; moderate computation cost.
  • MAUVE / distribution metrics: Strengths: Evaluate general text quality without references; good for creative/open-ended text; correlates with overall human-likeness. Weaknesses: Need many samples; doesn’t pinpoint errors in individual outputs; if the domain requires factuality, MAUVE alone won’t catch factual errors.
  • Accuracy/Exact Match: Strengths: Clear-cut correctness measure; easily understood; directly reflects task success; no fancy computation needed. Weaknesses: Only works when a single correct answer or classification is defined; too strict for nuanced answers; doesn’t capture partial credit unless you also use F1 or similar.
  • Factuality/Faithfulness (auto eval): Strengths: Targets one of the most important failure modes (hallucination); if you have a knowledge source, can automatically catch blatant errors; improves trust in model outputs. Weaknesses: Hard to automate fully; risk of false alarms or misses; often requires human verification or very good reference data. Automated tools like W&B's Hallucination scorer help but aren't perfect.
  • Toxicity/Bias/PII metrics: Strengths: Essential for safety; can automatically scan lots of outputs for red flags using tools like W&B's Toxicity and PII scorers; helps maintain ethical standards. Weaknesses: Classifiers can be imperfect (possible bias in the detector itself); might need continual updates; only catch surface-level issues.
  • Helpfulness/Human preference: Strengths: Ultimately reflects what users care about; broad and can capture things automated metrics miss; when used in RLHF, leads to big leaps in user satisfaction. Weaknesses: Expensive (requires human ratings) or approximate (LLM judges aren’t perfect); can be inconsistent if not carefully calibrated; may conflict with other metrics (a model might be very helpful but occasionally hallucinate – which do you prioritize?).
One striking challenge in LLM evaluation is that optimizing for a metric doesn’t always improve real quality. Models can be trained to game metrics. For instance, a model could output verbose answers that overlap a lot with the reference to jack up ROUGE, yet humans might find it repetitive or unclear. This is why relying on a single metric is dangerous – it creates a “Goodhart’s law” situation where the metric ceases to be a good measure once the model specifically optimizes for it. A combination of metrics, and periodic human spot-checks, is the safer approach.
Moreover, automatic metrics often fail on evaluating nuanced aspects like reasoning correctness. Two answers might both get full marks on BLEU or even BERTScore, but one could contain a reasoning error that only a human (or an advanced AI judge) would notice.

Aligning metrics with human judgment: challenges

Despite the many metrics at our disposal, a recurring theme is the gap between automatic metrics and human judgment. Some challenges include:
  • Low correlation in open-ended tasks: As research has noted, traditional metrics (BLEU, ROUGE) have relatively low correlation with human judgments, especially for creative tasks. This means a model that humans rate highly might not score highest on these metrics, and vice versa. It’s crucial to validate that your chosen metrics do correlate with human preferences for your task. If not, you might need to incorporate human eval in the loop.
  • Reference bias: Metrics that use reference answers assume the reference is the gold standard. But sometimes human references can be sub-optimal or one of many possible good answers. Humans might judge an LLM response better than the original reference (!), but the metric would penalize it for deviating from the reference. This has been observed in tasks like summarization, where an LLM might phrase something more clearly than the reference – a human would prefer the LLM output, but ROUGE would prefer the reference.
  • Human variance: Human evaluators themselves can disagree. What one person finds “helpful”, another might find “condescending”, for example. This makes it hard to get a single ground truth. The best practice is to have multiple ratings and average them, and ensure evaluators are well-trained on the criteria. Even then, there’s noise. Metrics like rank correlation (Spearman’s) are used to measure how well an automatic metric matches the ranking that humans would produce for a set of outputs. A metric might have, say, 0.5 correlation – which is decent but far from perfect alignment.
  • Evolving definitions: What is considered toxic or harmful can change, and companies may tighten guidelines over time. So a model that passes today’s safety metric might fail tomorrow’s stricter evaluation. Also, user expectations rise – an “acceptable” level of factual error a year ago might be unacceptable after users have seen a more advanced model that rarely makes such mistakes.
  • Complex multi-dimensional quality: A single scalar metric can’t capture the full multi-dimensional nature of “good” conversation or text. For instance, a dialogue response might be perfectly accurate (factuality=100%) but very terse (low helpfulness). Humans weigh these aspects depending on context. Automatic metric frameworks are being developed where you combine multiple scores (maybe a weighted sum or a vector of metrics). Some academic works attempt to learn a model that predicts a “human score” from various features (including those metrics). That’s essentially what GPT-4 as a judge is doing implicitly – using its knowledge to weigh different factors.
  • LLM as evaluator bias: Using LLMs to evaluate other LLMs (LLM-as-a-judge or “GPT-4 based grading”) is promising and has shown high agreement with humans in some studies. But it can introduce biases: for example, an LLM judge might favor outputs that mimic its own style. There’s a known concern that LLM judges might give higher scores to text that looks AI-generated vs. truly human responses (one paper noted GPT-4 preferred GPT-4’s outputs in some cases, a form of model bias). Thus, while LLM grading is useful, it’s good to mix in real human checks to calibrate it.

How to address these challenges?

One way is multi-metric evaluation: don’t rely on just one number. Look at a dashboard of metrics and see if a model is improving on all or most of them. If a change improves one metric at the expense of another (e.g., helpfulness up but factuality down), you need to decide the trade-off consciously.
Another approach is human-in-the-loop evaluation: periodically take a sample of outputs and have people review them thoroughly, even if your day-to-day eval is automated. This can catch issues your metrics miss. For example, a new type of hallucination might not be flagged by your current metric – a human will notice and then you can update your metric or add a new one.
Finally, consider building a feedback loop from production. If users can provide thumbs-up/down or ratings on answers, feed that back as an evaluation metric (“user satisfaction rate”). This real-world signal ultimately trumps lab metrics. Many companies deploy with an initial model and then refine based on user feedback distribution.

Conclusion and best practices

Evaluating LLMs is a complex but critical task. No single metric is sufficient – combining multiple metrics and periodically validating with human judgment is the best practice. Here’s a summary of recommendations for robust LLM evaluation:
  • Mix objective and subjective metrics: Use automatic scores for speed and scale, but include human-aligned evaluations for quality and safety aspects that are hard to quantify. For example, track perplexity or accuracy alongside user ratings and automated safety flags like Toxicity.
  • Leverage task-specific metrics: Tailor your evaluation to what success means for your application. If it’s code, run the code. If it’s a dialog, measure goal completion or user satisfaction. If it's RAG, use Faithfulness scorers. Generic scores are a starting point, but custom metrics capture what really matters for your use case.
  • Set up guardrails: Define threshold criteria for safety and quality using specific evaluators. For instance, “No more than X% of outputs can be flagged toxic by the Toxicity scorer” or “At least Y ROUGE on key summaries”. Tools like W&B Weave Guardrails can automate the application of these scorers (like Toxicity, PII, Hallucination) during evaluation, flagging problematic outputs and preventing the deployment of models that score outside acceptable bounds.
  • Use evaluation frameworks: Tools like W&B Evaluation help run evaluations reproducibly. They allow you to log each model’s outputs on a fixed test set, compute all your metrics (including scores from Guardrails), and compare results side by side. This makes A/B testing between model versions much easier – you can see that “Model B improved factuality by +10% but had slightly lower helpfulness by -5%, with safety unchanged, for example.”
  • Don’t over-optimize one number: Be wary of chasing a single metric too hard. Maintain a balanced scorecard. It’s often useful to have a table of metrics for each model (see Table 2 for an illustration) so you can make an informed decision beyond “this one number went up”. For instance:
Table 2: Hypothetical evaluation results.
ModelBLEUBERTScoreFactuality ErrorsToxicity RateHuman Pref (win %)
Baseline GPT-30.250.8515%1.2%— (baseline)
Fine-tuned GPT-30.220.875%1.0%72%
Model X (GPT-4)0.300.903%0.5%85%

Model X has the highest overlap scores (BLEU, BERTScore) and lowest toxicity, and humans preferred its outputs 85% of the time in head-to-head tests. The fine-tuned model improved factuality drastically (errors down to 5%) at a small cost to BLEU (possibly due to phrasing changes), and was preferred 72% over the baseline, indicating the metrics align with better quality.
  • Continuously update evaluations: As you encounter new failure modes (e.g., a specific kind of hallucination or user complaint), incorporate that into your test suite or metrics (perhaps by creating a custom scorer). Evaluation is an ongoing process, not a one-time checklist. The field of LLMs is evolving, and so are evaluation techniques – stay updated with the latest research on metrics and consider adopting new ones that might suit your needs (for example, new LLM-based evaluators or adversarial testing methods).
In conclusion, effective LLM evaluation requires a blend of rigor and realism. Rigor in using quantitative metrics and statistically sound comparisons; realism in reflecting actual user needs and values. By using a comprehensive set of metrics – covering automatic performance, human-centered quality, and safety (leveraging tools like W&B Guardrails for automated checks) – and by regularly calibrating these metrics against human judgment, we can ensure our LLMs are not just powerful, but also reliable, truthful, and beneficial for users.
With robust evaluation practices in place, you’ll be well-equipped to iterate on your LLM models, catch regressions or issues early, and deliver AI systems that truly meet their intended goals. Happy evaluating!



Resources


Iterate on AI agents and models faster. Try Weights & Biases today.