Last 5 Wandbot Evaluations Performance Analysis
Analysis of the most recent 5 evaluations in the wandbot-eval project, showing correctness scores and accuracy percentages
Created on June 10|Last edited on June 10
Comment
Last 5 Wandbot Evaluations Performance Analysis
Executive Summary
This report analyzes the performance of the 5 most recent evaluations in the wandbot-eval project, covering evaluations from April 17, 2025 to June 10, 2025.
Key Findings
Performance Metrics
- **Average Correctness Score**: 2.37/3.0 (79%)
- **Average Accuracy**: 59.4%
- **Best Performing Evaluation**: wandbot_v1.3.3_test-v55-index (91.0% accuracy)
- **Evaluation Duration Range**: 44 seconds to 1 hour 41 minutes
Evaluation Details
#### 1. wandbot_v1.3.3_test-v55-index (June 10, 2025)
- **Correctness Score**: 2.88/3.0
- **Accuracy**: 91.0% (446/490 correct answers)
- **Duration**: 1h 14m
- **Status**: Latest and best performing evaluation
#### 2. v1.3.2 PROD - wandbot_v1-3-2_test_updated_v54_index (May 19, 2025)
- **Correctness Score**: 2.87/3.0
- **Accuracy**: 90.4% (443/490 correct answers)
- **Duration**: 1h 41m
- **Status**: Production version with strong performance
#### 3. wandbot_v1-3-2_o4-mini-2025-04-16-medium-response_flash_query_enhancer (April 17, 2025)
- **Correctness Score**: 2.78/3.0
- **Accuracy**: 85.3% (418/490 correct answers)
- **Duration**: 1h 8m
- **Note**: Used GPT-4o-mini model variant
#### 4. intercom_eval_answers-1_trial (May 20, 2025)
- **Correctness Score**: 1.67/3.0
- **Accuracy**: 11.2% (11/98 correct answers)
- **Duration**: 44 seconds
- **Note**: Smaller test dataset (98 vs 490 samples)
#### 5. intercom_eval_answers-5_trial (May 20, 2025)
- **Correctness Score**: 1.63/3.0
- **Accuracy**: 9.0% (44/490 correct answers)
- **Duration**: 5 minutes
- **Note**: Appears to be testing with Intercom-specific data
Performance Trends
Strengths
- **Consistent High Performance**: The main wandbot versions (v1.3.2 and v1.3.3) show excellent performance with >85% accuracy
- **Improved Latest Version**: The most recent v1.3.3 evaluation achieved the highest accuracy at 91.0%
- **Reliability**: Main evaluations complete successfully with no errors
Areas for Investigation
- **Intercom Evaluations**: Both Intercom-specific evaluations show significantly lower performance (<12% accuracy), suggesting potential domain-specific challenges
- **Model Variants**: The o4-mini version shows slightly lower performance (85.3%) compared to the standard versions (90%+)
Cost Analysis
- **Token Usage**: Evaluations consume 2.3M - 3.0M total tokens
- **Cost per Evaluation**: Approximately $6-8 USD for full 490-sample evaluations
- **Model**: All evaluations use GPT-4o-2024-11-20
Recommendations
1. **Continue v1.3.3 Development**: Latest version shows best performance and should be prioritized
2. **Investigate Intercom Data**: Low performance on Intercom evaluations needs analysis - may require domain-specific tuning
3. **Model Selection**: Standard GPT-4o performs better than mini variant for this use case
4. **Monitoring**: Establish regular evaluation cadence to track performance trends
Technical Details
- **Evaluation Framework**: Weave-based evaluation system
- **Scorer**: WandbotCorrectnessScorer
- **Standard Dataset**: 490 samples for main evaluations
- **Test Dataset**: 98 samples for trial evaluations
Interactive Charts
The performance dashboard above shows two key metrics:
1. **Correctness Score Chart**: Shows the WandbotCorrectnessScorer rating (0-3 scale) for each evaluation
2. **Accuracy Percentage Chart**: Shows the percentage of correctly answered questions
Both charts are interactive - hover over bars to see detailed information about each evaluation including full name, metrics, date, and duration.
Add a comment