Introducing Eris v0.1: A novel LLM evaluation framework using debate simulations

Can you rank LLMs through debate? That's precisely what we're trying to find out.
Created on June 30|Last edited on July 16
Comment
Today, I'm excited to introduce Eris v0.1, a proof-of-concept LLM evaluation framework that assesses model performance through simulated debates. 
The Genesis of ErisEris emerged from a fundamental question: How can we evaluate LLMs in a way that simultaneously tests their reasoning, knowledge, and communication skills? Drawing inspiration from academic debates, Eris was designed to pit different LLMs against each other in structured debates on complex topics.
Developed with support from OpenRouter and leveraging Weights & Biases' Weave library, Eris is an open-source tool that simulates full debate flows, including constructive speeches, cross-examinations, rebuttals, and closing arguments. 
Eris: Greek Goddess of Strife & Discord (Stable Cascade)
What we'll be covering: The Genesis of ErisWhat we'll be covering: How Eris WorksInitial Results and InsightsLimitations and Future WorkWhat's Next for Eris?Support ErisAcknowledgments
﻿
﻿
How Eris WorksThe evaluation process in Eris v0.1 follows these steps:
Debate Setup: Two LLMs are assigned to debate a randomly selected topic, taking opposing positions.
Debate Flow: The models engage in a structured debate, mirroring real-world academic debates.
Evaluation: A separate LLM (currently Claude 3.5 Sonnet) acts as a judge, assessing the debate on various criteria:
Argument strength
Logical consistency
Evidence use
Cross-examination effectiveness
Rebuttal quality
Overall persuasiveness
Debate structure adherence
Logical fallacy identification
Rhetorical technique use
Adaptability
Communication clarity
Strategic framing
Emotional intelligence
Analysis: Results across many debates are aggregated to produce win rates and comparative metrics.
Initial Results and InsightsFor this proof-of-concept, 10 leading LLMs were evaluated. Here are the win rates:
ModelRatioWinsLossesanthropic/claude-3.5-sonnet0.7916143openai/gpt-4o0.7716450anthropic/claude-3-opus0.6613670qwen/qwen-2-72b-instruct0.6112278cohere/command-r-plus0.6111977openai/gpt-4-turbo0.4790102google/gemini-pro-1.50.469110601-ai/yi-large0.3263135mistralai/mistral-large0.2347155mistralai/mixtral-8x22b-instruct0.0715192\begin{array}{|l|c|c|c|}
\hline
\text{Model} & \text{Ratio} & \text{Wins} & \text{Losses} \\
\hline
\text{anthropic/claude-3.5-sonnet} & 0.79 & 161 & 43 \\
\text{openai/gpt-4o} & 0.77 & 164 & 50 \\
\text{anthropic/claude-3-opus} & 0.66 & 136 & 70 \\
\text{qwen/qwen-2-72b-instruct} & 0.61 & 122 & 78 \\
\text{cohere/command-r-plus} & 0.61 & 119 & 77 \\
\text{openai/gpt-4-turbo} & 0.47 & 90 & 102 \\
\text{google/gemini-pro-1.5} & 0.46 & 91 & 106 \\
\text{01-ai/yi-large} & 0.32 & 63 & 135 \\
\text{mistralai/mistral-large} & 0.23 & 47 & 155 \\
\text{mistralai/mixtral-8x22b-instruct} & 0.07 & 15 & 192 \\
\hline
\end{array}Modelanthropic/claude-3.5-sonnetopenai/gpt-4oanthropic/claude-3-opusqwen/qwen-2-72b-instructcohere/command-r-plusopenai/gpt-4-turbogoogle/gemini-pro-1.501-ai/yi-largemistralai/mistral-largemistralai/mixtral-8x22b-instruct​Ratio0.790.770.660.610.610.470.460.320.230.07​Wins1611641361221199091634715​Losses4350707877102106135155192​​﻿
And here's a heatmap showing the head-to-head performance of each model:
﻿
﻿
Key findings:
1. Claude 3.5 Sonnet (Anthropic) emerged as the top overall performer, with a 79% win rate.
2. GPT-4o (OpenAI) was a close second at 77%, consistently outperforming Claude 3.5 Sonnet in head-to-head matchups.
3. Claude 3 Opus (Anthropic) secured third place with a 66% win rate.
4. Qwen-72B-Chat (Qwen) demonstrated competitive performance against some stronger models.
5. Mixtral-8x22B-Instruct and Mistral-Large (Mistral AI) underperformed, possibly due to suboptimal prompting strategies.
These results offer intriguing insights into the relative strengths and weaknesses of different LLMs in a debate context. However, it's crucial to emphasize that this is a proof of concept, and the results should not be interpreted as definitive performance rankings.
This has not been peer-reviewed.
Limitations and Future WorkAs with any early-stage project, Eris v0.1 has several limitations that will be addressed in future iterations:
Judging bias: The current use of Claude 3.5 Sonnet for all judging could introduce bias. Future versions may implement an ensemble of models or human cross-examination debate judges for more balanced evaluation.
Prompt engineering: The current system uses a one-size-fits-all prompt. Custom prompts autonomously crafted per model via an iterative process could provide a fairer comparison.
Expanded evaluation criteria: Incorporating more nuanced evaluation metrics could yield deeper insights into model capabilities.
Larger sample size: Increasing the number of debates per model pair would produce more statistically robust results.
Human validation: Introducing human evaluation alongside AI judging could provide valuable cross-validation.
 More models: Expanding the evaluation pool to include a wider range of LLMs will offer a more comprehensive view of the current AI landscape.
What's Next for Eris?Work is currently underway on Eris v1.0, a more comprehensive evaluation framework that will implement many of the proposed improvements mentioned above. This next iteration aims to address the limitations of v0.1 and provide a more robust and insightful evaluation framework.
In parallel, a comprehensive paper detailing the Eris framework, its methodology, and a more in-depth analysis of the results is in development. This paper will also implement some of the proposed improvements and outline future research directions.
In the spirit of open-source, all of the (extremely messy) code and data from the Eris project is available on Github and HuggingFace, respectively. I'd love for you to check them out!
Support ErisIf you're excited about Eris and would like to support this work, I'm currently seeking funding and grants to further develop and expand the project. If you're interested in supporting Eris or collaborating on building the 1.0 version, please don't hesitate to DM me on Twitter.
﻿
AcknowledgmentsSpecial thanks to OpenRouter for a generous grant of $240 and to Weights & Biases for their Weave library, both of which were instrumental in building Eris. Additional thanks to Paige Bailey for assistance in refining and developing Eris into a fully-fledged project.
﻿OpenRouter:
OpenRouter's API played a crucial role in the development of Eris. By providing seamless access to a wide array of LLMs, OpenRouter significantly streamlined the evaluation process. This allowed for efficient testing of multiple models without the need to manage individual API integrations, greatly accelerating the project's development and expanding its scope.
﻿Weights & Biases' Weave:
Weights & Biases' Weave library was instrumental in the development and operation of Eris. Weave's robust debugging and monitoring capabilities ensured smooth execution of the debates, helping to maintain cohesiveness throughout the evaluation process. Its real-time logging and visualization features allowed for continuous oversight of the LLM interactions, enabling quick identification and resolution of any issues while the benchmark was running. This level of monitoring was crucial in maintaining the integrity of the debates and the overall quality of the evaluation framework.
﻿
It's important to emphasize that Eris v0.1 is a rough proof of concept and not meant to be used as a source of ground truth. It's an experimental benchmark designed to spark discussion and inspire new approaches to LLM evaluation.
💡
﻿
Add a comment
Tags: Articles, Community Posts, Tutorial, LLM, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.