Skip to main content

Wandbot Evaluation Analysis Report

Analysis of recent Wandbot evaluations with different model configurations, including GPT-4o vs GPT-4-0125
Created on March 21|Last edited on March 21

Wandbot Evaluation Analysis Report



Executive Summary

This report analyzes recent Wandbot evaluations in the `mcp-tests` project, focusing on performance metrics across different model configurations. The analysis reveals that the latest evaluation using GPT-4o (November 2024) demonstrates significant improvements in both correctness and efficiency compared to previous configurations using GPT-4-0125.

Key Findings

- **Correctness**: The GPT-4o model achieved 76.5% correctness, an 8.1% improvement over previous evaluations - **Latency**: Average response time decreased from 372.1s in February to 94.1s with GPT-4o - **API Reliability**: The GPT-4o evaluation achieved 100% success rate across all API calls - **Error Rates**: Chat API error rates have steadily improved from 2.0% to 0% in the most recent evaluation

Evaluation Overview

Four evaluation traces were analyzed from the `wandb-applied-ai-team/mcp-tests` project: | Evaluation Name | Date | Model | Samples | Trials | Run Time | |-----------------|------|-------|---------|--------|----------| | wandbot_gpt-4o-2024-11-20 | Mar 12, 2025 | GPT-4o | 98 | 1 | ~17 min | | wandbot_less_query_enhancement | Mar 12, 2025 | GPT-4-0125 | 98 | 1 | ~4 min | | dummy-evaluation | Feb 27, 2025 | Custom | 9 | 1 | ~2 sec | | wandbot-eval | Feb 27, 2025 | GPT-4-0125 | 98 | 2 | ~18 min |

Performance Metrics

Correctness and Latency

The most recent evaluation with GPT-4o shows significant improvements: | Metric | GPT-4o (Mar 12) | Less Query Enhancement (Mar 12) | Original Eval (Feb 27) | Improvement | |--------|-----------------|----------------------------------|------------------------|-------------| | Correctness | 76.5% | 68.4% | 68.4% | +8.1% | | Avg. Latency | 94.1s | 37.9s | 372.1s | -74.7% from original | While the "Less Query Enhancement" configuration achieved the lowest latency at 37.9 seconds, it came at the cost of correctness compared to the GPT-4o configuration.

API Success Rates

API reliability has improved across evaluations: | API Type | GPT-4o (Mar 12) | Less QE (Mar 12) | Original (Feb 27) | |----------|-----------------|------------------|-------------------| | Chat success | 100.0% | 99.0% | 98.0% | | Reranker success | 100.0% | 100.0% | 100.0% | | Embedding success | 100.0% | 100.0% | 100.0% | | Query enhancer success | 100.0% | 100.0% | 100.0% | | Chat Errors | 0.0% | 1.0% | 2.0% | The GPT-4o evaluation achieved perfect API reliability with zero errors across all API calls.

Bias Metrics (Dummy Evaluation)

The "dummy-evaluation" trace provided bias metrics: | Metric | Score | Occurrence | |--------|-------|------------| | Gender Bias | 0.2% | 0.0% | | Racial Bias | 28.6% | 33.3% | | Context Relevance | 43.0% | 33.3% |

Configuration Details

Wandbot Chat Configuration

All evaluations used similar configuration settings with different model versions: | Configuration | GPT-4o | Less QE | Original | |---------------|--------|---------|----------| | Top K | 15 | 15 | 15 | | Reranker provider | Cohere | Cohere | Cohere | | Reranker model | rerank-english-v2.0 | rerank-english-v2.0 | rerank-english-v2.0 | | Query enhancer model | gpt-4o-2024-11-20 | gpt-4-0125-preview | gpt-4-0125-preview | | Response synthesizer model | gpt-4o-2024-11-20 | gpt-4-0125-preview | gpt-4-0125-preview |

Conclusion and Recommendations

Based on the evaluation results, the GPT-4o configuration delivers the best overall performance: 1. **Higher correctness**: 76.5% correct answers represents an 8.1% improvement 2. **Improved reliability**: 100% success rate across all API calls 3. **Reasonable latency**: 94.1s average response time, a 74.7% improvement from February

Recommendations

1. **Adopt GPT-4o**: The significant improvements in correctness and reliability justify adopting the GPT-4o model for production use 2. **Further latency optimization**: Investigate if elements from the "Less Query Enhancement" configuration could be incorporated to further reduce latency while maintaining correctness 3. **Additional evaluation metrics**: Consider adding more comprehensive bias testing similar to the dummy evaluation for future evaluations

Appendix: System Information

All evaluations were run on Darwin Kernel 24.3.0 with Python 3.12.8, except for the dummy-evaluation which used Linux with Python 3.11.11.

Run set
38



Run set
38



Run set
38