Hallucination Model Performance Comparison
Analysis of model performance metrics across different runs
Created on April 2|Last edited on April 2
Comment
Hallucination Model Performance Comparison
Overview
This report compares key performance metrics across different runs in the c-metrics/hallucination project, highlighting where recent runs outperform or underperform previous runs.
Key Metrics Comparison
Accuracy
| Run Name | Display Name | Accuracy | Created At |
|----------|--------------|----------|------------|
| b11nnuci | SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 | 64.0% | 2024-12-07 |
| vdikgyb2 | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2 | 63.0% | 2024-12-07 |
| zniqzrnw | SmolLM2-360M-sft-hallu-lr5e-06-ne5-wr0.05 | 61.4% | 2024-12-09 |
| 191sgm2q | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.35 | 61.2% | 2024-12-09 |
| x7fva6ug | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.05 | 60.1% | 2024-12-07 |
| z891045k | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05 | 58.7% | 2024-12-07 |
| z7otc99c | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 58.4% | 2024-12-09 |
| i7haathy | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05-wd0.01-eps1e-6 | 57.8% | 2024-12-08 |
| if0xkp4d | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 56.7% | 2024-12-07 |
| 4ljrg7uj | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 55.5% | 2024-12-09 |
| t5xundrt | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 55.4% | 2024-12-08 |
| guminfu9 | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 | 54.4% | 2024-12-09 |
| 3a7jhe7y | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-wd0.02 | 53.8% | 2024-12-08 |
F1 Score
| Run Name | Display Name | F1 Score | Created At |
|----------|--------------|----------|------------|
| b11nnuci | SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 | 0.661 | 2024-12-07 |
| vdikgyb2 | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2 | 0.594 | 2024-12-07 |
| kl8j41ou | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.1 | 0.580 | 2024-12-07 |
| z891045k | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05 | 0.554 | 2024-12-07 |
| if0xkp4d | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 0.545 | 2024-12-07 |
| guminfu9 | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 | 0.543 | 2024-12-09 |
| zniqzrnw | SmolLM2-360M-sft-hallu-lr5e-06-ne5-wr0.05 | 0.532 | 2024-12-09 |
| x7fva6ug | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.05 | 0.532 | 2024-12-07 |
| 4ljrg7uj | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 0.522 | 2024-12-09 |
| 3a7jhe7y | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-wd0.02 | 0.502 | 2024-12-08 |
| t5xundrt | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 0.501 | 2024-12-08 |
| z7otc99c | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 0.397 | 2024-12-09 |
| 191sgm2q | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.35 | 0.319 | 2024-12-09 |
| i7haathy | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05-wd0.01-eps1e-6 | 0.294 | 2024-12-08 |
Loss
| Run Name | Display Name | Loss | Created At |
|----------|--------------|------|------------|
| 191sgm2q | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.35 | 0.127 | 2024-12-09 |
| x7fva6ug | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.05 | 0.127 | 2024-12-07 |
| z891045k | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05 | 0.134 | 2024-12-07 |
| b11nnuci | SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 | 0.185 | 2024-12-07 |
| i7haathy | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05-wd0.01-eps1e-6 | 0.205 | 2024-12-08 |
| z7otc99c | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 0.326 | 2024-12-09 |
| if0xkp4d | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 0.311 | 2024-12-07 |
| vwrot6b3 | SmolLM2-360M-sft-hallu-lr6e-05-ne15-wr0.05 | 0.346 | 2024-12-08 |
| guminfu9 | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 | 0.499 | 2024-12-09 |
| vdikgyb2 | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2 | 0.600 | 2024-12-07 |
| kl8j41ou | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.1 | 0.594 | 2024-12-07 |
| t5xundrt | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 0.660 | 2024-12-08 |
| 3a7jhe7y | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-wd0.02 | 0.659 | 2024-12-08 |
| 4ljrg7uj | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 0.673 | 2024-12-09 |
| zniqzrnw | SmolLM2-360M-sft-hallu-lr5e-06-ne5-wr0.05 | 0.926 | 2024-12-09 |
Precision & Recall
| Run Name | Display Name | Precision | Recall | Created At |
|----------|--------------|-----------|--------|------------|
| 191sgm2q | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.35 | 0.632 | 0.214 | 2024-12-09 |
| b11nnuci | SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007 | 0.552 | 0.824 | 2024-12-07 |
| zniqzrnw | SmolLM2-360M-sft-hallu-lr5e-06-ne5-wr0.05 | 0.550 | 0.514 | 2024-12-09 |
| vdikgyb2 | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2 | 0.558 | 0.636 | 2024-12-07 |
| x7fva6ug | SmolLM2-360M-sft-hallu-lr0.0003-ne5-wr0.05 | 0.532 | 0.533 | 2024-12-07 |
| z7otc99c | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 0.519 | 0.322 | 2024-12-09 |
| z891045k | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05 | 0.513 | 0.603 | 2024-12-07 |
| kl8j41ou | SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.1 | 0.513 | 0.669 | 2024-12-07 |
| i7haathy | SmolLM2-360M-sft-hallu-lr5e-05-ne15-wr0.05-wd0.01-eps1e-6 | 0.512 | 0.207 | 2024-12-08 |
| if0xkp4d | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 0.493 | 0.608 | 2024-12-07 |
| 4ljrg7uj | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.35 | 0.481 | 0.570 | 2024-12-09 |
| t5xundrt | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05 | 0.479 | 0.526 | 2024-12-08 |
| guminfu9 | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-bs128 | 0.474 | 0.636 | 2024-12-09 |
| 3a7jhe7y | SmolLM2-360M-sft-hallu-lr5e-05-ne5-wr0.05-wd0.02 | 0.464 | 0.547 | 2024-12-08 |
Analysis of Recent Runs
Your most recent complete run (guminfu9) from December 9, 2024 shows the following performance characteristics:
Areas where guminfu9 outperforms previous runs:
- **Better recall (0.636)** than many runs, indicating good sensitivity in detecting hallucinations
- **Lower loss (0.499)** than several runs with similar configurations
- **Competitive F1 score (0.543)** that ranks in the upper half of all runs
Areas where guminfu9 underperforms:
- **Lower accuracy (54.4%)** compared to the best runs (64.0% for b11nnuci)
- **Lower precision (0.474)** than most other runs, indicating more false positives
The data suggests that your recent run with batch size 128 has made a trade-off favoring recall over precision, which might be beneficial depending on your specific application needs. If false negatives (missing hallucinations) are more costly than false positives, this could be a good direction.
Best Performing Models
The overall best performing models appear to be:
1. **b11nnuci (SmolLM2-135M-sft-hallu-lr0.0003-ne15-wr0.007)** - Highest accuracy (64.0%), highest F1 score (0.661), and excellent recall (0.824)
2. **vdikgyb2 (SmolLM2-135M-sft-hallu-lr5e-05-ne5-wr0.2)** - Strong accuracy (63.0%) and good F1 score (0.594)
3. **zniqzrnw (SmolLM2-360M-sft-hallu-lr5e-06-ne5-wr0.05)** - Good accuracy (61.4%) but higher loss (0.926)
Interestingly, some of the smaller 135M parameter models are outperforming the larger 360M models on key metrics.
Add a comment