Skip to main content

Hallucination Scorer

Created on November 28|Last edited on January 9

Metric Definition

“Hallucination” consists of a broad range of LLM generation errors. Common definitions include:
  • Faithfulness, a measure focussed on factual consistency. Related closely to Context Adherence but subtly different in measurement due to its focus on the factfulness of individual claims in the generation.
  • Context Adherence, a measure based on content fidelity
  • Entailment, measuring whether the generation is logically consistent with the provided context document(s).
Our definition of hallucination covers both Faithfulness and Entailment. When measured using entailment our definition of hallucination requires that the hypothesis (LLM generation) is classed as “contradictory”, a conservative approach that provides a strong signal of hallucination. “Neutral” in this case is ambiguous as the hypothesis does not directly conflict with the premise.
Faithfulness is a useful measure to understand how acceptable the generation is in a closed-domain setting and that it doesn’t introduce external information or facts, maintaining focus on the provided context. Faithfulness is also useful to understand whether the “Neutral” case from entailment should also be further considered as a hallucination.

v1 Model Selection

For the v1 or the Weave Scorer for hallucination we have selected the hhem-2.1 open source model.

Evaluation

Our current best performing model is a SmolLM2-135M-Instruct based model*, its W&B training run can be found here.
With a F1 score of 0.62 and 76% recall we see this as a baseline to further build on. Both GPT-4o-mini and HHEM-2.1 has an F1 of 0.66. The HHEM-2.1 model is currently the best open source model that we have evaluated and we feel confident we will be able to reach its performance and exceed it in future training runs.
*We chose this architecture as it will easily enable use to scale up our training pipelines to larger, more performant models such as the 360M variant or Qwen1.5-0.5B in future.



Usage

This is how to call the Hallucination Scorer
from weave.scorers import HallucinationScorer

hallucination_scorer = HallucinationScorer()

result = hallucination_scorer.score(
query="What is the capital of Antartica?",
context="People in Antartica love the penguins.",
output="While Antartica is reknown for its sea life, penguins aren't the favorite"
)
print(f"Output is hallucinated: {result['is_hallucination']}")

Datasets

The RAGTruth and FinQA datasets are used for training and evaluation.
RAGTruth
This is a commonly used RAG dataset for training and benchmarking hallucination. It has almost 18k samples split as 15,090 train samples (44% with hallucinations) and 2,700 (34% with hallucinations) test samples. Our processed dataset is here.
FinQA
The FinQA dataset includes questions, context and answers to financial data and is challenging as it also includes data tables as part of the context. Based on the ground truth answer provided, we used o1-mini to generate synthetic hallucinated answers. For each sample in the original dataset there is now a hallucinated version. Our processed dataset is available here.
In both cases we limited the test set size to 1,000 samples each and used the remainder as part of the training set

RAGTruth details

Training

The metrics, metadata and checkpoints for our best performing run can be found here.

Methodology

Dataset

SmolLM2-135M

SmolLM2-135M-Instruct - 2024-11-29

SmolLM2-135M-Instruct - 2024-11-30

SmolLM2-135M-Instruct - 2024-12-01

SmolLM2-135M-Instruct - 2024-12-01, lr 0.001 - Best run

Dec 6th update - 360M-Instruct, Qwen-0.5B-Instruct, new test ds, modified prompt, deepspeed/Warmup

Dec 7th - SmolLM2 competitive

Dec 8th - Updated deepspeed AdamW epsilon to 1e-6, adding weight decay, pad_to="left"

  • SmolLM2-360 base still not doing great vs 135M, exploring changes:
  • Updated deepspeed AdamW epsilon to 1e-6 from 1e-8, starting from this run, 12:00pm IST 8/12/2024
  • Experimenting with higher weight decay (0.1) and epsilon to test it
  • Ran 1 small trial of using pad_to="left" in inference in this eval vs this eval but it didn't change the F1 score.

Dec 9th Adam epsilon 1e-8 vs 1e-6

While switching to deepspeed we wanted to ensure the Adam epsilon is correct. Experimented with 1e-6 instead of the default 1e-8, however the default was better in this case.

Run set
2


Checkpoints

models = {
"HuggingFaceTB/SmolLM2-135M-Instruct-sft-hallu-lr0.001-ne7-wr0.05": {
"checkpoints": [3535, 3987, 4430],
"artifact": "c-metrics/hallucination/SmolLM2-135M-Instruct-sft-hallu:v56",
"run_url": "https://wandb.ai/c-metrics/hallucination/runs/7bpeof03"
},
"HuggingFaceTB/SmolLM2-135M-Instruct-sft-hallu-lr0.001-ne11-wr0.1": {
"checkpoints": [5555, 5055, 4549],
"artifact": "c-metrics/hallucination/SmolLM2-135M-Instruct-sft-hallu:v62",
"run_url": "https://wandb.ai/c-metrics/hallucination/runs/ni8fx8ff"
},
"HuggingFaceTB/SmolLM2-135M-Instruct-sft-hallu-lr0.001-ne9-wr0.05": {
"checkpoints": [4044, 4430, 4545],
"artifact": "c-metrics/hallucination/SmolLM2-135M-Instruct-sft-hallu:v59",
"run_url": "https://wandb.ai/c-metrics/hallucination/runs/yr17p4oq"
},
"HuggingFaceTB/SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007": {
"checkpoints": [3650, 5215, 5736, 6258, 6779, 7301, 7815],
"artifact": "c-metrics/hallucination/SmolLM2-135M-sft-hallu:v32",
"wandb_run_url": "https://wandb.ai/c-metrics/hallucination/runs/b11nnuci"
}
"HuggingFaceTB/SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.05": {
"checkpoints": [521, 1042, 1563, 2085, 2605], # 2605 was best
"artifact": "c-metrics/hallucination/SmolLM2-360M-sft-hallu:v2",
"wandb_run_url": "https://wandb.ai/c-metrics/hallucination/runs/x7fva6ug"
},
"HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05": {
"checkpoints": [10, 521, 1042, 1563, 2084, 2085, 2605], # 2605 was also best
"artifact": "c-metrics/hallucination/SmolLM2-360M-sft-hallu:v3",
"wandb_run_url": "https://wandb.ai/c-metrics/hallucination/runs/if0xkp4d"
}
}
"HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05": {
"checkpoints": [2085, 2605, 3126,3647, 4169, 4690, 5211, 5732, 6253,6774, 7295, 7815], # 7295 was also best
"artifact": "c-metrics/hallucination/SmolLM2-360M-sft-hallu:v4",
"wandb_run_url": "https://wandb.ai/c-metrics/hallucination/runs/z891045k"
}
"HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05-wd0.01-eps1e-6": {
"checkpoints": [1042, 1563, 2084, 2085, 2605, 3126,3647, 4169, 4690, 5211, 5732, 6253,6774, 7295, 7815], # 7815 was best
"artifact": "c-metrics/hallucination/SmolLM2-360M-sft-hallu:v5",
"wandb_run_url": "https://wandb.ai/c-metrics/hallucination/runs/i7haathy"
}
}

Further Improvements