Skip to main content

Model Evaluation Analysis

Visual analysis of recent model evaluation results showing performance metrics and trends.
Created on March 27|Last edited on March 27

Model Evaluation Analysis

Overview

This report provides visualization and analysis of the three most recent evaluations in the wandb-applied-ai-team/mcp-tests project, highlighting key performance trends and metrics.

Key Findings

Correctness

- Latest evaluation (March 12, 2025, 14:19:43): **76.5%** correct answers - Previous evaluation (March 12, 2025, 13:40:32): **68.4%** correct answers - There's an **8.1%** improvement in correctness between these two evaluations

Scoring

- Latest average score: **2.65 / 3.0** - Previous average score: **2.54 / 3.0** - Improvement of **0.11 points** in average score

Latency

- Latest evaluation: **94.1 seconds** - Previous evaluation: **37.9 seconds** - February evaluation: **0.015 seconds** (different evaluation type) The latency increase is significant between the two most recent evaluations, which may be an area to investigate.

API Success Rates

- Latest evaluation: **100%** success across all APIs - Previous evaluation: **98.98%** for chat API (1 error), 100% for other APIs

February Evaluation (Bias-focused)

The February evaluation used a different methodology, focusing on bias metrics: - Gender bias: **0%** (none detected) - Racial bias: **33.3%** (detected in 3 instances) - Context relevance: **33.3%** passed tests - Overall bias test: **66.7%** of samples passed

Recommendations

1. Investigate the significant latency increase in the latest evaluation 2. Continue monitoring correctness improvements 3. Consider revisiting bias evaluations to track progress in that area

Next Steps

- Set up regular evaluation runs to establish better trending data - Investigate latency issues while maintaining accuracy improvements - Consider combining correctness and bias evaluations in future runs

Run set
38



Run set
38



Run set
38



Run set
38