Skip to main content

AI guardrails: Robustness scorers

Robustness evaluates how consistently large language models perform under noisy or perturbed inputs, using statistical metrics like Cohen’s d to quantify their reliability and adaptability in real-world applications.
Created on January 2|Last edited on March 1
Robustness is a key measure of an AI system’s reliability, particularly for large language models (LLMs), which generate outputs probabilistically rather than deterministically. This means that even when given the same input, an LLM’s response may vary. More critically, small changes to the input - such as typos, rewording, or punctuation differences - can sometimes lead to unexpectedly large shifts in output.
For instance, an LLM might correctly answer a straightforward question but fail to do so when the same question is rephrased slightly or includes a typographical error. Such inconsistencies can undermine trust in these systems, particularly in high-stakes scenarios like healthcare, legal applications, or safety-critical systems. Robustness acts as a guardrail, ensuring that models produce stable and reliable outputs even when faced with input variability. By quantifying robustness, we gain insights into a system’s limitations and ensure it is reliable enough for deployment in real-world environments.
Prefer to get hands-on right away? Explore our interactive Colab to start evaluating robustness now.

or
Jump to the tutorial


For those who want a deeper understanding, we’ll break down the principles behind robustness, explain key evaluation metrics, and provide the tools and code needed to integrate these strategies into your AI projects.


Table of contents



What is robustness?

Robustness measures how reliably a large language model (LLM) maintains consistent outputs when inputs are slightly altered. These variations - such as typos, rephrasings, and formatting changes - can cause disproportionate differences in model responses, affecting their reliability in real-world applications.
For example, an LLM may correctly answer a straightforward question but fail when the same question is reworded. In high-stakes applications like healthcare or legal systems, inconsistent outputs can lead to confusion or serious consequences. Robustness ensures that an AI system produces stable, predictable responses even when input conditions change.

Why does robustness matter?

AI models must handle unpredictable user inputs effectively. If an AI system produces different responses to semantically equivalent queries, users may lose confidence in its reliability. Robustness provides a critical guardrail, preventing erratic behavior and ensuring consistent performance across varying inputs.
  • In customer service, an AI assistant must provide consistent answers regardless of how users phrase their questions.
  • In healthcare, a slight formatting change in a patient’s symptoms should not alter a diagnosis.
  • In legal applications, document analysis should remain stable even when text formatting varies.
By prioritizing robustness, organizations can ensure their AI models deliver consistent, accurate, and trustworthy outputs across a range of real-world conditions. Guardrails like robustness metrics and structured evaluations help prevent unpredictable failures and ensure models meet reliability standards.

Robustness metrics and how to calculate them

Robustness is measured by comparing a model’s performance on original versus perturbed inputs. Statistical effect size metrics—such as Cohen’s h and Cohen’s d—quantify how much outputs change due to small variations in input. These variations can include typos, rewording, punctuation changes, or formatting differences.

The role of ground truth in robustness evaluation

Having ground truth labels is important for ensuring that robustness translates into accuracy. A system might produce stable responses across perturbed inputs, but if those responses are incorrect, consistency alone is not meaningful.
  • With ground truth labels: Robustness can be measured by comparing model outputs against a correct reference.
  • Without ground truth labels: The output from the original input is treated as a proxy ground truth, and perturbed outputs are compared against it.
By using ground truth as a guardrail, Weave ensures that robustness evaluations focus on both consistency and correctness, helping prevent models from reinforcing errors.

Why Performance Drop Rate (PDR) is not used in Weave

A common way to measure robustness is Performance Drop Rate (PDR), which calculates the fractional decrease in output quality for perturbed inputs. The formula for PDR is:
PDR=(scoreomean_scorepscoreo)×100\text{PDR} = \left( \frac{\text{score}_o - \text{mean\_score}_p}{\text{score}_o} \right) \times 100

where:
  • scoreo=1.0score_o = 1.0, is the quality of the original generation, and
    mean_scorepmean\_score_p is the mean quality of the output generated using perturbed inputs as measured against the original generation as the ground truth.
If the ground truth is present and the LLM system outputs binary values, scoreo0,1score_o \in {0,1} (i.e, 0 or 1) while mean_scorep[0,1]mean\_score_p \in [0,1].
But PDR is limited in a few different scenarios:
  • PDR is inherently an asymmetric function of its inputs. Eg: if the model’s score drops from 0.8 to 0.4, the fractional decrease is (0.8 - 0.4) / 0.8 = 0.5, meaning the performance drop is 50%. However, if the model’s score increases from 0.4 to 0.8, the fractional increase is calculated as (0.8 - 0.4) / 0.4 = 1.0, meaning the “performance improvement” is considered 100%.
  • When scoreo=0score_o = 0, PDR is undefined (zero division error). Clearly, PDR is not useful when we have ground truth as reference. In a binary output setting, the scoreoscore_o can be zero.
  • For non-adversarial perturbations, the mean_scorep>scoreomean\_score_p > score_o, i.e, the performance can improve. This will give a negative PDR value. This is not necessarily a bad thing but we would like a score between 0-1.
For these reasons, the Weave library does not use PDR as a metric for robustness, and instead the library uses Cohen's h and Cohen's d for measuring robustness.
💡

Cohen's h (for binary outputs)

The paper, A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios, was used as a reference for implementing Cohen's h. Cohen's h is a widely used statistical effect size metric that quantifies the difference between two proportions. It is particularly suitable for tasks with binary evaluation outcomes. It is defined as,
H(scoreio,scoreip)=ψ(scoreip)ψ(scoreio),where ψ(scorei)=2(arcsin(scorei))\mathrm{H}(\text{score}_i^o, \text{score}_i^p) = \psi(\text{score}_i^p) - \psi(\text{score}_i^o), \quad \text{where } \psi(\text{score}_i) = 2 \left( \arcsin\left(\sqrt{\text{score}_i}\right) \right)

where
  • scoreoscore_o is the quality of the original generation, and
  • scorepscore_p is the mean quality of the perturbed generations
The arcsine transformation stabilizes the variance across proportions and ensures that changes near the extremes, such as 0 or 1, are appropriately emphasized. Additionally, it ensures the metric reflects the statistical detectability of differences rather than simply highlighting raw changes.
For example, consider H(0.8, 1.0) \approx 0.295 and H(0.6, 0.8) \approx 0.141. While both represent a 0.2 difference, the drop from 1.0 to 0.8 is more statistically significant, as proportions closer to 1 represent more confident predictions. Cohen’s h is particularly suited for measuring robustness in systems evaluated on binary outcomes (e.g., exact match scoring, discrete cases). Unlike PDR (Performance Drop Rate), Cohen’s h is symmetric and provides statistical interpretability.
The use of arcsine function ensures that the range of Cohen's h is bounded by [π,π][-\pi, \pi]. This makes normalization (H~=H/π\tilde{H} = H / \pi ) straightforward and interpretable, mapping the metric to the range of [1,1][-1, 1]. We report the magnitude (H~|\tilde{H}|) with the range of [0,1][0,1].
If your LLM system's output is binary, do this to use Cohen's h metric for quantifying the robustness of your system:
robustness_scorer = RobustnessScorer(use_exact_match=True)
This is an effect size metric (not correlation metric) which provides a quantitative measure of the magnitude of a difference or deviation. However, the raw numerical values of the metric can lack intuitive meaning and thus a rule of thumb interpretation is needed. Here is a chart showing the mapping from Cohen's d values to a practical interpretation string:
Cohen's hh RangeInterpretation
0.0≤h<0.15920.0 \leq h<0.1592Small effect
0.1592≤h<0.38200.1592 \leq h<0.3820Medium effect
0.3820≤h≤1.00.3820 \leq h \leq 1.0Huge effect

By default the scorer will return only numeric values. If you want to get the rule of thumb interpretations, pass true to the return_interpretation argument in the RobustnessScorer:
robustness_scorer = RobustnessScorer(
use_exact_match=True,
return_interpretation=True
)
Let's illustrate the effectiveness of this metric with one more example:
Original accuracies (scoreoscore_o) = {1.0, 0.8, 0.8}
Mean perturbed accuracies (scorepscore_p) = {0.8, 0.6, 0.36}
Using cohen's h:
H(1.0,0.8)=0.295H(1.0, 0.8) = 0.295
H(0.8,0.6)=0.141H(0.8, 0.6) = 0.141
H(0.8,0.36)=0.295H(0.8, 0.36) = 0.295
The drop from 1.0 to 0.8 is as significant as the drop from 0.8 to 0.36, despite the later representing a larger raw decrease. Thus cohen's h takes into account the statistical significance of proportions closer to 1 or 0 (extremes). Another key factor for using this metric is the independence from sample size of the perturbed variables. The metric will work just fine (from an interpretation pov) if you only have 1 perturbed generation.
However this metric has one key limitation in that it works for binary outcomes but in reality certain systems like free-form text generation will better be evaluated on continuous "quality" scores like semantic similarity.

Cohen's d (for free-form text outputs)

So far we have discussed cohen's h using exact string matching as the quality measure which converted the strings to booleans (by specifying use_exact_match=True). However, with most of today's LLM evaluations, relying on an exact match to measure correctness is not feasible, as there can be many correct answers to a single query.
We can thus use Semantic Textual Similarity which is a task of evaluating how similar two texts are in terms of meaning. In the Weave implementation, the all-MiniLM-L6-v2 embedding model is used as a default, but you can also specify an embedding model of your choice. We use cosine similarity to generate a similarity score between the predicted and ground truth text.
If ground truth is not present, the scoreoscore_o is 1.0 because the cosine similarity of a text with itself will be 1.0. In this case, the mean_scorepmean\_score_p is average of the cosine similarities between the perturbed generations and the original generation. The value can be anything between [0,1].
If the ground truth is present, the scoreoscore_o is computed as cosine similarity between the original generation and the ground truth. The mean_scorepmean\_score_p is the average of the cosine similarities between the perturbed generations with the ground truth. The value can be anything between [0,1].
Since the scores are continuous, we cannot use cohen's h, and we instead use Cohen's d. Cohen's d is automatically used in the Scorer if use_exact_match=False. Unlike Cohen's h which is nicely bounded between [0,1], cohen's d is unbounded and can range from negative to positive infinity, and this is because Cohen’s d measures the standardized difference between two means in terms of standard deviations.
In our context, we have a single original score and multiple perturbed scores. This situation resembles a paired sample scenario, where each perturbed score is directly related to the original score. Here is the equation for the Cohen's d metric:
d=dˉsdd = \frac{\bar{d}}{s_d}

where, dˉ\bar{d} is the mean of the the differences between paired observations and sds_d is the standard deviation of these differences.
Given Cohen's d is unbounded for most practical cases the boundary will be between [-3, 3] but we are only returning the absolute value of the value i.e, our practical boundary will be [0, 3]. We do understand the implication of the sign which helps understand if the perturbations are increasing or decreasing the similarity scores. A positive sign means that the perturbations are lowering the similarity score while a negative sign means that perturbations are improving the similarity (in practice this should be rare). Because of this, the Weave scorer also returns a string interpretation stating either "positive" or "negative".
Cohen's d is theoretically unbounded, but in most practical cases, it falls within the range of [-3, 3]. For simplicity, we return only the absolute value, effectively limiting the practical range to [0, 3]. Since the sign of Cohen's d provides useful insights—positive values indicate that perturbations are reducing similarity scores, whereas negative values (rare in practice) suggest an improvement in similarity—we incorporate this information into a clear string interpretation. The result is labeled as either 'positive' or 'negative' to convey the direction of the effect.
Since we are dividing by the standard deviation, if it is close to zero and the mean and not exactly zero, the computed Cohen's d can be extremely large (> 3). For example suppose the dˉ\bar{d} is 0.01 while the sds_d is 0.0001, then dd will be 0.01/0.0001 = 100. This can be misleading. This will in practice only happen if the sample size of perturbed generations is very small (2-3). If you are noticing this effect, consider increasing the number of perturbed examples.
💡
Further more, remember that Cohen's d should be interpreted within the practical context of your use-case. But here's a rule of thumb we will provide if you set return_interpretation=True. Here is a chart showing the mapping from Cohen's d values to a practical interpretation string:
Cohen's d RangeInterpretation
0.0 ≤ d < 0.2Small effect
0.2 ≤ d < 0.8Medium effect
0.80 ≤ dHuge effect


Why you need robustness

Robustness is essential for ensuring that AI systems perform reliably in real-world scenarios, where input variability is inevitable. Users may introduce typos, rephrasings, adversarial inputs, or unexpected formatting changes, and without robustness, models can fail unpredictably - leading to poor user experiences or even critical failures in high-stakes applications like healthcare, legal automation, or financial analysis.
By quantifying robustness, we gain a systematic way to evaluate and compare models. Guardrails like structured robustness metrics help prevent unpredictable behavior, ensuring models remain resilient to input variations.
Beyond evaluation, robustness metrics provide actionable insights for improvement. If a model struggles with specific perturbations, it can be fine-tuned with more diverse training data or adjusted using guardrails to handle edge cases. However, robustness alone is not enough—models must also maintain accuracy. A system that consistently produces incorrect responses under perturbation may appear robust, but such stability is meaningless without relevance and correctness.
Ultimately, robustness is a measure of trustworthiness. It serves as a guardrail that ensures AI models perform well not just under ideal conditions, but also when faced with real-world input variations. This combination of consistency and quality is critical for deploying AI in applications where reliability is paramount.

Evaluating robustness with Weave scorers

This tutorial illustrates the implementation of robustness metrics to evaluate large language models using Weave. The goal is to quantify how reliably models perform when faced with input variability, such as typos, paraphrasing, or formatting changes. By using structured scoring methods as guardrails, Weave ensures that robustness evaluations are both meaningful and reproducible.
The dataset used for this evaluation is a subset of the AYA Evaluation Suite, a benchmark that contains 26,750 open-ended, conversation-style prompts designed to evaluate multilingual open-ended generation quality. For this tutorial, we focus on the AYA Human-Annotated subset, filtered to include only English-language examples (eng), and sample 50 examples from the dataset.
To simulate real-world variability and test model robustness, our dataset will be augmented through perturbations. These perturbations are generated programmatically using techniques such as typos, paraphrasing, punctuation changes, and spacing errors. Specifically, a perturbation function will be used from the weave.scorers to augment our input samples, and introduce controlled variations to the original inputs, creating multiple perturbed versions for each sample.
To start, first run the following install command:
git clone https://github.com/wandb/weave.git && cd weave && git fetch origin pull/3006/head:xtra-scorers && git checkout xtra-scorers && pip install -qq -e .
Also, export your OpenAI key:
export OPENAI_API_KEY='your api key'
Next, we will write an evaluation script that compares GPT-4o and GPT-4o Mini on our dataset:
import weave
import pandas as pd
import os
import random
import asyncio
import time
import nest_asyncio
from litellm import acompletion
from openai import RateLimitError
from weave import Evaluation
from weave.scorers import RobustnessScorer, create_perturbed_dataset
from datasets import load_dataset

# Initialize Weave client
weave_client = weave.init("robustness-scorer-eval")

# Set environment variables
os.environ["OPENAI_API_KEY"] = "your api key"

# Define constants
EXPONENTIAL_BASE = 2

# LiteLLMSystem Class
class LiteLLMSystem(weave.Model):
model_name: str
temp: float = 0.0
max_tokens: int = 2048
top_p: float = 0.95
max_retries: int = 3

def __init__(self, **data):
super().__init__(**data)
if "o1" in self.model_name:
self.temp = None
self.top_p = None
self.max_tokens = None

@weave.op()
async def predict(self, prompt: str):
delay = 2
for i in range(self.max_retries):
try:
response = await acompletion(
model=self.model_name,
messages=[{"role": "user", "content": prompt}],
temperature=self.temp,
max_tokens=self.max_tokens,
top_p=self.top_p,
)
if response.choices[0].message.content is not None:
return response.choices[0].message.content
else:
raise Exception("No content in response")
except RateLimitError as e:
delay *= EXPONENTIAL_BASE * (1 + random.random())
time.sleep(delay)
continue
except Exception as e:
continue
raise Exception("Failed to get response after max retries")


# Perturber Classes
class PerturberLLMSystemGPT4oMini(weave.Model):
system: LiteLLMSystem
prompt: str = """You are a helpful assistant who can answer different kind of questions in a very concise manner.
Question: {question}
Answer:
"""

@weave.op()
async def predict(self, questions: list[str]) -> list[str]:
answers = []
for question in questions:
answers.append(await self.system.predict(
self.prompt.format(question=question)
))
return answers


class PerturberLLMSystemGPT4o(weave.Model):
system: LiteLLMSystem
prompt: str = """You are a helpful assistant who can answer different kind of questions in a very concise manner.
Question: {question}
Answer:
"""

@weave.op()
async def predict(self, questions: list[str]) -> list[str]:
answers = []
for question in questions:
answers.append(await self.system.predict(
self.prompt.format(question=question)
))
return answers


# Load and preprocess dataset
ds = load_dataset("CohereForAI/aya_evaluation_suite", "aya_human_annotated")["test"]
ds = ds.filter(lambda x: x["language"] == "eng").select_columns(["inputs", "targets"])
ds = ds.rename_column("targets", "ground_truths")
ds_sampled = ds.shuffle(seed=42).select(range(50))

# Create perturbed dataset
original_inputs = ds_sampled["inputs"]
perturbed_inputs = create_perturbed_dataset(original_inputs, num_perturbations=4)
ground_truths = ds_sampled["ground_truths"]

def create_free_form_question_answer(perturbed_inputs, ground_truths):
free_form_question_answer = []
for idx, question_set in enumerate(perturbed_inputs):
questions = question_set['questions']
ground_truth = ground_truths[idx]
answers = [ground_truth] * len(questions)
free_form_question_answer.append({
'questions': questions,
'ground_truths': answers
})
return free_form_question_answer

free_form_question_answer = create_free_form_question_answer(perturbed_inputs, ground_truths)

# Run evaluation with RobustnessScorer
robustness_scorer = RobustnessScorer(
use_exact_match=False, # Use embedding-based semantic similarity
use_ground_truths=True,
)

evaluation = Evaluation(
dataset=free_form_question_answer,
scorers=[robustness_scorer]
)

# Evaluate systems
system4o = LiteLLMSystem(model_name="gpt-4o-2024-08-06")
perturber4o = PerturberLLMSystemGPT4o(system=system4o)
asyncio.run(evaluation.evaluate(perturber4o))

system4oMini = LiteLLMSystem(model_name="gpt-4o-mini")
perturber4oMini = PerturberLLMSystemGPT4oMini(system=system4oMini)
asyncio.run(evaluation.evaluate(perturber4oMini))

In the setup, the Weave framework is used to manage evaluations. Datasets are prepared by introducing controlled perturbations to the original inputs, simulating real-world variability. These perturbed examples are paired with either ground truth outputs or the original model outputs as references. Since we have ground truth labels available for this evaluation, we also include them in the dataset. Additionally, we set use_exact_match equal to false, ensuring that outputs are judged based on their meaning rather than string-matching.
Evaluations leverage Weave's RobustnessScorer, which is designed to calculate the similarity between outputs for original and perturbed inputs. The scoring process uses the Cohen's d metric to quantify the effect size of perturbations. This metric provides a rigorous statistical basis for assessing robustness across varied scenarios.
Here are the results for our evaluation:

Additionally, we can analyze the specific responses for each model on each example, using the comparisons view inside Weave:

This feature provides a detailed side-by-side comparison of the outputs generated by each model for every individual example in the dataset, along with their corresponding reference text. It allows us to see the qualitative differences in how each model approaches the task, such as variations in phrasing and inclusion of key details. By drilling down into these comparisons, we can identify patterns in strengths and weaknesses for each model, gaining deeper insights into why one model might perform better overall or fail in specific scenarios. This level of granularity is helpful for debugging and improving model performance.

Conclusion

Robustness is a key factor in ensuring that large language models maintain consistent performance when faced with noisy or perturbed inputs. Without robustness, AI models can become unreliable, producing inconsistent outputs in real-world applications where user inputs vary due to typos, rephrasings, or formatting differences.
This evaluation used the AYA Evaluation Suite, enhanced with controlled perturbations, to measure robustness using semantic similarity metrics like Cohen’s d, which are particularly suited for continuous outputs. By avoiding metrics like PDR, which lacks statistical stability, and Cohen’s h, which is more appropriate for discrete outputs, Weave ensures that robustness is measured in a way that aligns with the nature of AI-generated responses.
Weave simplifies the robustness evaluation process by providing a structured framework to:
  • Organize and compare robustness scores across different models
  • Use statistical effect size metrics (Cohen’s h for binary outputs, Cohen’s d for free-form text)
  • Ensure interpretability and reproducibility through side-by-side comparisons
These guardrails help teams identify weaknesses, improve model stability, and ensure AI systems maintain trustworthy and predictable behavior across diverse real-world inputs.
By integrating robustness assessments into AI workflows, organizations can proactively improve model resilience, reduce unexpected failures, and build AI systems that meet the highest standards of consistency, accuracy, and reliability.

Iterate on AI agents and models faster. Try Weights & Biases today.