Skip to main content

Hallucination Metrics Performance Comparison

Created on April 2|Last edited on April 2

Hallucination Metrics Performance Comparison

Overview

This report compares performance metrics between current and previous runs in the c-metrics/hallucination project, focusing on key evaluation metrics including accuracy, F1 score, precision, and recall.

Key Metrics Comparison

Accuracy

- **Best performer**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 (64.0%) - **Runner-up**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2 (63.0%) - **Recent run**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 (54.4%) The smaller 135M model with higher learning rate (0.0003) and lower weight regularization (0.007) outperforms the larger models on accuracy.

F1 Score

- **Best performer**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 (0.661) - **Runner-up**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.05 (0.604) - **Recent run**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 (0.543) The 135M models consistently outperform the 360M models on F1 score, with the combination of higher learning rate and lower weight regularization showing particular strength.

Precision

- **Best performer**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.35 (0.632) - **Runner-up**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2 (0.558) - **Recent run**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 (0.474) The 360M model with higher learning rate (0.0003) and high weight regularization (0.35) achieves the best precision.

Recall

- **Best performer**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 (0.824) - **Runner-up**: HuggingFaceTB/SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.05 (0.702) - **Recent run**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 (0.636) The 135M models significantly outperform the 360M models on recall, with the highest learning rate and lowest weight regularization configuration showing particularly strong performance.

Loss

- **Best performer (lowest loss)**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.35 (0.127) - **Runner-up**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.05 (0.127) - **Recent run**: HuggingFaceTB/SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 (0.499) The 360M models with higher learning rate (0.0003) achieve the lowest loss values.

Key Insights

1. **Model Size Impact**: Interestingly, the smaller 135M models outperform the larger 360M models on several key metrics, particularly accuracy, F1 score, and recall. 2. **Learning Rate Effect**: Higher learning rates (0.0003) generally lead to better performance across model sizes compared to lower learning rates (5e-05). 3. **Weight Regularization Influence**: Lower weight regularization values (0.007-0.05) tend to perform better on F1 score and recall, while higher values (0.35) can improve precision but at the cost of recall. 4. **Current vs. Previous Performance**: The most recent run (SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128) underperforms compared to previous configurations on all key metrics, suggesting that the batch size increase to 128 may be negatively impacting model performance, or that the combination of learning rate and weight regularization is suboptimal. 5. **Best Overall Configuration**: The HuggingFaceTB/SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 configuration demonstrates the best overall performance, achieving the highest scores on accuracy, F1, and recall, suggesting that this smaller model with optimized hyperparameters is more effective for this hallucination detection task.