Wandbot Evaluation Analysis Report
Analysis of recent Wandbot evaluations with different model configurations, including GPT-4o vs GPT-4-0125
Created on March 21|Last edited on March 21
Comment
Wandbot Evaluation Analysis Report
Wandbot Evaluation Analysis ReportExecutive SummaryKey FindingsEvaluation OverviewPerformance MetricsCorrectness and LatencyAPI Success RatesBias Metrics (Dummy Evaluation)Configuration DetailsWandbot Chat ConfigurationConclusion and RecommendationsRecommendationsAppendix: System Information
Executive Summary
This report analyzes recent Wandbot evaluations in the `mcp-tests` project, focusing on performance metrics across different model configurations. The analysis reveals that the latest evaluation using GPT-4o (November 2024) demonstrates significant improvements in both correctness and efficiency compared to previous configurations using GPT-4-0125.
Key Findings
- **Correctness**: The GPT-4o model achieved 76.5% correctness, an 8.1% improvement over previous evaluations
- **Latency**: Average response time decreased from 372.1s in February to 94.1s with GPT-4o
- **API Reliability**: The GPT-4o evaluation achieved 100% success rate across all API calls
- **Error Rates**: Chat API error rates have steadily improved from 2.0% to 0% in the most recent evaluation
Evaluation Overview
Four evaluation traces were analyzed from the `wandb-applied-ai-team/mcp-tests` project:
| Evaluation Name | Date | Model | Samples | Trials | Run Time |
|-----------------|------|-------|---------|--------|----------|
| wandbot_gpt-4o-2024-11-20 | Mar 12, 2025 | GPT-4o | 98 | 1 | ~17 min |
| wandbot_less_query_enhancement | Mar 12, 2025 | GPT-4-0125 | 98 | 1 | ~4 min |
| dummy-evaluation | Feb 27, 2025 | Custom | 9 | 1 | ~2 sec |
| wandbot-eval | Feb 27, 2025 | GPT-4-0125 | 98 | 2 | ~18 min |
Performance Metrics
Correctness and Latency
The most recent evaluation with GPT-4o shows significant improvements:
| Metric | GPT-4o (Mar 12) | Less Query Enhancement (Mar 12) | Original Eval (Feb 27) | Improvement |
|--------|-----------------|----------------------------------|------------------------|-------------|
| Correctness | 76.5% | 68.4% | 68.4% | +8.1% |
| Avg. Latency | 94.1s | 37.9s | 372.1s | -74.7% from original |
While the "Less Query Enhancement" configuration achieved the lowest latency at 37.9 seconds, it came at the cost of correctness compared to the GPT-4o configuration.
API Success Rates
API reliability has improved across evaluations:
| API Type | GPT-4o (Mar 12) | Less QE (Mar 12) | Original (Feb 27) |
|----------|-----------------|------------------|-------------------|
| Chat success | 100.0% | 99.0% | 98.0% |
| Reranker success | 100.0% | 100.0% | 100.0% |
| Embedding success | 100.0% | 100.0% | 100.0% |
| Query enhancer success | 100.0% | 100.0% | 100.0% |
| Chat Errors | 0.0% | 1.0% | 2.0% |
The GPT-4o evaluation achieved perfect API reliability with zero errors across all API calls.
Bias Metrics (Dummy Evaluation)
The "dummy-evaluation" trace provided bias metrics:
| Metric | Score | Occurrence |
|--------|-------|------------|
| Gender Bias | 0.2% | 0.0% |
| Racial Bias | 28.6% | 33.3% |
| Context Relevance | 43.0% | 33.3% |
Configuration Details
Wandbot Chat Configuration
All evaluations used similar configuration settings with different model versions:
| Configuration | GPT-4o | Less QE | Original |
|---------------|--------|---------|----------|
| Top K | 15 | 15 | 15 |
| Reranker provider | Cohere | Cohere | Cohere |
| Reranker model | rerank-english-v2.0 | rerank-english-v2.0 | rerank-english-v2.0 |
| Query enhancer model | gpt-4o-2024-11-20 | gpt-4-0125-preview | gpt-4-0125-preview |
| Response synthesizer model | gpt-4o-2024-11-20 | gpt-4-0125-preview | gpt-4-0125-preview |
Conclusion and Recommendations
Based on the evaluation results, the GPT-4o configuration delivers the best overall performance:
1. **Higher correctness**: 76.5% correct answers represents an 8.1% improvement
2. **Improved reliability**: 100% success rate across all API calls
3. **Reasonable latency**: 94.1s average response time, a 74.7% improvement from February
Recommendations
1. **Adopt GPT-4o**: The significant improvements in correctness and reliability justify adopting the GPT-4o model for production use
2. **Further latency optimization**: Investigate if elements from the "Less Query Enhancement" configuration could be incorporated to further reduce latency while maintaining correctness
3. **Additional evaluation metrics**: Consider adding more comprehensive bias testing similar to the dummy evaluation for future evaluations
Appendix: System Information
All evaluations were run on Darwin Kernel 24.3.0 with Python 3.12.8, except for the dummy-evaluation which used Linux with Python 3.11.11.
Run set
38
Run set
38
Run set
38
Add a comment