Skip to main content

Last 5 Wandbot Evaluations Performance Analysis

Analysis of the most recent 5 evaluations in the wandbot-eval project, showing correctness scores and accuracy percentages
Created on June 10|Last edited on June 10

Last 5 Wandbot Evaluations Performance Analysis

Executive Summary

This report analyzes the performance of the 5 most recent evaluations in the wandbot-eval project, covering evaluations from April 17, 2025 to June 10, 2025.

Key Findings

Performance Metrics

- **Average Correctness Score**: 2.37/3.0 (79%) - **Average Accuracy**: 59.4% - **Best Performing Evaluation**: wandbot_v1.3.3_test-v55-index (91.0% accuracy) - **Evaluation Duration Range**: 44 seconds to 1 hour 41 minutes

Evaluation Details

#### 1. wandbot_v1.3.3_test-v55-index (June 10, 2025) - **Correctness Score**: 2.88/3.0 - **Accuracy**: 91.0% (446/490 correct answers) - **Duration**: 1h 14m - **Status**: Latest and best performing evaluation #### 2. v1.3.2 PROD - wandbot_v1-3-2_test_updated_v54_index (May 19, 2025) - **Correctness Score**: 2.87/3.0 - **Accuracy**: 90.4% (443/490 correct answers) - **Duration**: 1h 41m - **Status**: Production version with strong performance #### 3. wandbot_v1-3-2_o4-mini-2025-04-16-medium-response_flash_query_enhancer (April 17, 2025) - **Correctness Score**: 2.78/3.0 - **Accuracy**: 85.3% (418/490 correct answers) - **Duration**: 1h 8m - **Note**: Used GPT-4o-mini model variant #### 4. intercom_eval_answers-1_trial (May 20, 2025) - **Correctness Score**: 1.67/3.0 - **Accuracy**: 11.2% (11/98 correct answers) - **Duration**: 44 seconds - **Note**: Smaller test dataset (98 vs 490 samples) #### 5. intercom_eval_answers-5_trial (May 20, 2025) - **Correctness Score**: 1.63/3.0 - **Accuracy**: 9.0% (44/490 correct answers) - **Duration**: 5 minutes - **Note**: Appears to be testing with Intercom-specific data

Strengths

- **Consistent High Performance**: The main wandbot versions (v1.3.2 and v1.3.3) show excellent performance with >85% accuracy - **Improved Latest Version**: The most recent v1.3.3 evaluation achieved the highest accuracy at 91.0% - **Reliability**: Main evaluations complete successfully with no errors

Areas for Investigation

- **Intercom Evaluations**: Both Intercom-specific evaluations show significantly lower performance (<12% accuracy), suggesting potential domain-specific challenges - **Model Variants**: The o4-mini version shows slightly lower performance (85.3%) compared to the standard versions (90%+)

Cost Analysis

- **Token Usage**: Evaluations consume 2.3M - 3.0M total tokens - **Cost per Evaluation**: Approximately $6-8 USD for full 490-sample evaluations - **Model**: All evaluations use GPT-4o-2024-11-20

Recommendations

1. **Continue v1.3.3 Development**: Latest version shows best performance and should be prioritized 2. **Investigate Intercom Data**: Low performance on Intercom evaluations needs analysis - may require domain-specific tuning 3. **Model Selection**: Standard GPT-4o performs better than mini variant for this use case 4. **Monitoring**: Establish regular evaluation cadence to track performance trends

Technical Details

- **Evaluation Framework**: Weave-based evaluation system - **Scorer**: WandbotCorrectnessScorer - **Standard Dataset**: 490 samples for main evaluations - **Test Dataset**: 98 samples for trial evaluations

Interactive Charts

The performance dashboard above shows two key metrics: 1. **Correctness Score Chart**: Shows the WandbotCorrectnessScorer rating (0-3 scale) for each evaluation 2. **Accuracy Percentage Chart**: Shows the percentage of correctly answered questions Both charts are interactive - hover over bars to see detailed information about each evaluation including full name, metrics, date, and duration.