Model Evaluation Analysis

Visual analysis of recent model evaluation results showing performance metrics and trends.
Created on March 27|Last edited on March 27
Comment
﻿
Model Evaluation Analysis
OverviewThis report provides visualization and analysis of the three most recent evaluations in the wandb-applied-ai-team/mcp-tests project, highlighting key performance trends and metrics.
Key Findings
Correctness- Latest evaluation (March 12, 2025, 14:19:43): **76.5%** correct answers
- Previous evaluation (March 12, 2025, 13:40:32): **68.4%** correct answers
- There's an **8.1%** improvement in correctness between these two evaluations
Scoring- Latest average score: **2.65 / 3.0**
- Previous average score: **2.54 / 3.0**
- Improvement of **0.11 points** in average score
Latency- Latest evaluation: **94.1 seconds**
- Previous evaluation: **37.9 seconds**
- February evaluation: **0.015 seconds** (different evaluation type)
The latency increase is significant between the two most recent evaluations, which may be an area to investigate.
API Success Rates- Latest evaluation: **100%** success across all APIs
- Previous evaluation: **98.98%** for chat API (1 error), 100% for other APIs
February Evaluation (Bias-focused)The February evaluation used a different methodology, focusing on bias metrics:
- Gender bias: **0%** (none detected)
- Racial bias: **33.3%** (detected in 3 instances)
- Context relevance: **33.3%** passed tests
- Overall bias test: **66.7%** of samples passed
Recommendations1. Investigate the significant latency increase in the latest evaluation
2. Continue monitoring correctness improvements
3. Consider revisiting bias evaluations to track progress in that area
Next Steps- Set up regular evaluation runs to establish better trending data
- Investigate latency issues while maintaining accuracy improvements
- Consider combining correctness and bias evaluations in future runs
﻿
Run set38
﻿
﻿
﻿
Run set38
﻿
﻿
﻿
Run set38
﻿
﻿
﻿
Run set38
﻿
﻿
Add a comment