Top 3 Evaluations Analysis Report

Comprehensive analysis of the top-performing evaluations in mcp-tests project, comparing correctness scores, costs, and performance metrics

Morgan

Created on May 22|Last edited on May 22

Comment

﻿
Top 3 W&B Evaluations Analysis Report*wandb-applied-ai-team/mcp-tests*
Executive SummaryThis report analyzes the top 3 performing evaluations from the mcp-tests project, focusing on correctness scores, cost efficiency, and operational metrics. Our analysis reveals significant insights about model performance trade-offs and optimization opportunities.
Evaluation Overview
1. **wandbot_gpt-4o-2024-11-20** ⭐ *Best Overall Performance*- **Correctness Score:** 76.5% (75/98 correct answers)
- **Total Samples:** 98
- **Success Rate:** 100%
- **Average Latency:** 94.1ms
- **Total Cost:** $5.39
- **Date:** March 12, 2025
2. **wandbot_less_query_enhancement** 🚀 *Best Latency*- **Correctness Score:** 68.4% (67/98 correct answers)
- **Total Samples:** 98  
- **Success Rate:** 99.0%
- **Average Latency:** 37.9ms (60% faster)
- **Total Cost:** $5.30
- **Date:** March 12, 2025
3. **wandbot-eval** 📊 *Largest Scale*- **Correctness Score:** 68.4% (134/196 correct answers)
- **Total Samples:** 196 (2x larger dataset)
- **Success Rate:** 98.0%
- **Average Latency:** 372.1ms
- **Total Cost:** $10.64
- **Date:** February 27, 2025
Key Performance Metrics
Correctness AnalysisThe GPT-4o configuration achieved the highest correctness score at **76.5%**, representing an **8.1 percentage point improvement** over the other configurations. This translates to approximately **8 additional correct answers** out of every 100 queries.
Cost Efficiency- **Most Cost-Effective:** Less Query Enhancement at $5.30
- **Best Value:** GPT-4o at $5.39 (only $0.09 more for 8% better performance)
- **Highest Cost:** Standard evaluation at $10.64 (2x cost for same per-sample performance)
Latency Performance- **Fastest:** Less Query Enhancement at 37.9ms average
- **Production-Ready:** GPT-4o at 94.1ms average
- **Slowest:** Standard evaluation at 372.1ms average
Cost vs Performance AnalysisThe analysis reveals a clear **efficiency frontier** where GPT-4o provides the optimal balance:
- **8% higher correctness** than alternatives
- **Only 1.7% higher cost** than the cheapest option
- **Acceptable latency** for most production use cases
API Reliability InsightsAll evaluations demonstrated robust operational characteristics:
- **Query Enhancer LLM API:** 99-100% success rate across all evaluations
- **Chat API:** 98-100% success rate with minimal error rates
- **Embedding/Reranker APIs:** 100% success rate consistently
- **Error Handling:** Comprehensive error tracking and recovery
Token Usage Analysis
GPT-4o Configuration- **Prompt Tokens:** 517,225
- **Completion Tokens:** 7,255
- **Total Requests:** 98
- **Average Tokens per Request:** 5,352
Less Query Enhancement- **Prompt Tokens:** 508,141 (1.8% reduction)
- **Completion Tokens:** 7,335 (1.1% increase)
- **Total Requests:** 98
- **Average Tokens per Request:** 5,261
Recommendations
For Production Deployment1. **Primary Choice:** GPT-4o configuration
   - Best accuracy-cost trade-off
   - Reliable sub-100ms latency
   - Proven 100% success rate
For Cost-Sensitive Applications2. **Alternative:** Less Query Enhancement
   - 60% latency reduction
   - Minimal cost increase
   - Acceptable 8% performance trade-off
For Large-Scale Evaluations3. **Considerations:** Batch processing approach
   - Standard evaluation shows linear cost scaling
   - Consider parallel processing for time efficiency
   - Monitor per-sample cost metrics
Technical Implementation Notes
Model Configuration Differences- **GPT-4o:** Enhanced query processing with full feature set
- **Less QE:** Reduced query enhancement for speed optimization
- **Standard:** Full-featured baseline with larger evaluation dataset
Success Rate Factors- Robust error handling across all configurations
- Consistent API reliability (>98% success)
- Effective fallback mechanisms for failed requests
Future Optimization Opportunities1. **Hybrid Approach:** Use Less QE for time-sensitive queries, GPT-4o for accuracy-critical tasks
2. **Batch Optimization:** Implement request batching to reduce per-query overhead
3. **Caching Strategy:** Cache common query patterns to reduce API costs
4. **A/B Testing:** Continuous evaluation of new model configurations
ConclusionThe GPT-4o configuration emerges as the clear winner, providing the best overall value proposition with **76.5% correctness** at a competitive cost of **$5.39**. For applications requiring ultra-low latency, the Less Query Enhancement variant offers a viable alternative with acceptable performance trade-offs.
These insights enable data-driven decisions for model deployment strategies and highlight the importance of comprehensive evaluation frameworks in LLM application development.
---
*Report generated on May 22, 2025 | Data source: W&B Weave Evaluations*
﻿
Run set38
﻿
﻿
﻿
Run set38
﻿
﻿

Add a comment

Top 3 Evaluations Analysis Report

Top 3 W&B Evaluations Analysis Report

Executive Summary

Evaluation Overview

1. wandbot_gpt-4o-2024-11-20 ⭐ Best Overall Performance

2. wandbot_less_query_enhancement 🚀 Best Latency

3. wandbot-eval 📊 Largest Scale

Key Performance Metrics

Correctness Analysis

Cost Efficiency

Latency Performance

Cost vs Performance Analysis

API Reliability Insights

Token Usage Analysis

GPT-4o Configuration

Less Query Enhancement

Recommendations

For Production Deployment

For Cost-Sensitive Applications

For Large-Scale Evaluations

Technical Implementation Notes

Model Configuration Differences

Success Rate Factors

Future Optimization Opportunities

Conclusion