Skip to main content

Top 3 Evaluations Analysis Report

Comprehensive analysis of the top-performing evaluations in mcp-tests project, comparing correctness scores, costs, and performance metrics
Created on May 22|Last edited on May 22

Top 3 W&B Evaluations Analysis Report

*wandb-applied-ai-team/mcp-tests*

Executive Summary

This report analyzes the top 3 performing evaluations from the mcp-tests project, focusing on correctness scores, cost efficiency, and operational metrics. Our analysis reveals significant insights about model performance trade-offs and optimization opportunities.

Evaluation Overview

1. **wandbot_gpt-4o-2024-11-20** ⭐ *Best Overall Performance*

- **Correctness Score:** 76.5% (75/98 correct answers) - **Total Samples:** 98 - **Success Rate:** 100% - **Average Latency:** 94.1ms - **Total Cost:** $5.39 - **Date:** March 12, 2025

2. **wandbot_less_query_enhancement** 🚀 *Best Latency*

- **Correctness Score:** 68.4% (67/98 correct answers) - **Total Samples:** 98 - **Success Rate:** 99.0% - **Average Latency:** 37.9ms (60% faster) - **Total Cost:** $5.30 - **Date:** March 12, 2025

3. **wandbot-eval** 📊 *Largest Scale*

- **Correctness Score:** 68.4% (134/196 correct answers) - **Total Samples:** 196 (2x larger dataset) - **Success Rate:** 98.0% - **Average Latency:** 372.1ms - **Total Cost:** $10.64 - **Date:** February 27, 2025

Key Performance Metrics

Correctness Analysis

The GPT-4o configuration achieved the highest correctness score at **76.5%**, representing an **8.1 percentage point improvement** over the other configurations. This translates to approximately **8 additional correct answers** out of every 100 queries.

Cost Efficiency

- **Most Cost-Effective:** Less Query Enhancement at $5.30 - **Best Value:** GPT-4o at $5.39 (only $0.09 more for 8% better performance) - **Highest Cost:** Standard evaluation at $10.64 (2x cost for same per-sample performance)

Latency Performance

- **Fastest:** Less Query Enhancement at 37.9ms average - **Production-Ready:** GPT-4o at 94.1ms average - **Slowest:** Standard evaluation at 372.1ms average

Cost vs Performance Analysis

The analysis reveals a clear **efficiency frontier** where GPT-4o provides the optimal balance: - **8% higher correctness** than alternatives - **Only 1.7% higher cost** than the cheapest option - **Acceptable latency** for most production use cases

API Reliability Insights

All evaluations demonstrated robust operational characteristics: - **Query Enhancer LLM API:** 99-100% success rate across all evaluations - **Chat API:** 98-100% success rate with minimal error rates - **Embedding/Reranker APIs:** 100% success rate consistently - **Error Handling:** Comprehensive error tracking and recovery

Token Usage Analysis

GPT-4o Configuration

- **Prompt Tokens:** 517,225 - **Completion Tokens:** 7,255 - **Total Requests:** 98 - **Average Tokens per Request:** 5,352

Less Query Enhancement

- **Prompt Tokens:** 508,141 (1.8% reduction) - **Completion Tokens:** 7,335 (1.1% increase) - **Total Requests:** 98 - **Average Tokens per Request:** 5,261

Recommendations

For Production Deployment

1. **Primary Choice:** GPT-4o configuration - Best accuracy-cost trade-off - Reliable sub-100ms latency - Proven 100% success rate

For Cost-Sensitive Applications

2. **Alternative:** Less Query Enhancement - 60% latency reduction - Minimal cost increase - Acceptable 8% performance trade-off

For Large-Scale Evaluations

3. **Considerations:** Batch processing approach - Standard evaluation shows linear cost scaling - Consider parallel processing for time efficiency - Monitor per-sample cost metrics

Technical Implementation Notes

Model Configuration Differences

- **GPT-4o:** Enhanced query processing with full feature set - **Less QE:** Reduced query enhancement for speed optimization - **Standard:** Full-featured baseline with larger evaluation dataset

Success Rate Factors

- Robust error handling across all configurations - Consistent API reliability (>98% success) - Effective fallback mechanisms for failed requests

Future Optimization Opportunities

1. **Hybrid Approach:** Use Less QE for time-sensitive queries, GPT-4o for accuracy-critical tasks 2. **Batch Optimization:** Implement request batching to reduce per-query overhead 3. **Caching Strategy:** Cache common query patterns to reduce API costs 4. **A/B Testing:** Continuous evaluation of new model configurations

Conclusion

The GPT-4o configuration emerges as the clear winner, providing the best overall value proposition with **76.5% correctness** at a competitive cost of **$5.39**. For applications requiring ultra-low latency, the Less Query Enhancement variant offers a viable alternative with acceptable performance trade-offs. These insights enable data-driven decisions for model deployment strategies and highlight the importance of comprehensive evaluation frameworks in LLM application development. --- *Report generated on May 22, 2025 | Data source: W&B Weave Evaluations*

Run set
38



Run set
38