Model Evaluation Analysis
Visual analysis of recent model evaluation results showing performance metrics and trends.
Created on March 27|Last edited on March 27
Comment
Model Evaluation Analysis
Overview
This report provides visualization and analysis of the three most recent evaluations in the wandb-applied-ai-team/mcp-tests project, highlighting key performance trends and metrics.
Key Findings
Correctness
- Latest evaluation (March 12, 2025, 14:19:43): **76.5%** correct answers
- Previous evaluation (March 12, 2025, 13:40:32): **68.4%** correct answers
- There's an **8.1%** improvement in correctness between these two evaluations
Scoring
- Latest average score: **2.65 / 3.0**
- Previous average score: **2.54 / 3.0**
- Improvement of **0.11 points** in average score
Latency
- Latest evaluation: **94.1 seconds**
- Previous evaluation: **37.9 seconds**
- February evaluation: **0.015 seconds** (different evaluation type)
The latency increase is significant between the two most recent evaluations, which may be an area to investigate.
API Success Rates
- Latest evaluation: **100%** success across all APIs
- Previous evaluation: **98.98%** for chat API (1 error), 100% for other APIs
February Evaluation (Bias-focused)
The February evaluation used a different methodology, focusing on bias metrics:
- Gender bias: **0%** (none detected)
- Racial bias: **33.3%** (detected in 3 instances)
- Context relevance: **33.3%** passed tests
- Overall bias test: **66.7%** of samples passed
Recommendations
1. Investigate the significant latency increase in the latest evaluation
2. Continue monitoring correctness improvements
3. Consider revisiting bias evaluations to track progress in that area
Next Steps
- Set up regular evaluation runs to establish better trending data
- Investigate latency issues while maintaining accuracy improvements
- Consider combining correctness and bias evaluations in future runs
Run set
38
Run set
38
Run set
38
Run set
38
Add a comment