FinanceBench: A New Benchmark for Financial Question Answering
A new benchmark for financial LLM's!
Created on November 13|Last edited on November 13
Comment
FinanceBench is a unique and comprehensive dataset designed to evaluate large language models, or LLMs, on financial question-answering (QA) tasks. Developed by Patronus AI with collaborators from Contextual AI and Stanford University, FinanceBench offers a benchmark to assess how well LLMs can respond to questions commonly posed by finance professionals. The dataset includes over 10,000 questions and answers, using data from public filings of 40 U.S.-traded companies to create an evaluation standard in financial QA, a previously underserved area in AI assessment. The questions in FinanceBench are structured to measure accuracy, reasoning, and retrieval abilities in finance-specific contexts.
FinanceBench functions as an "open book" test, designed to evaluate how well LLMs can retrieve and utilize relevant information from contextually provided resources that simulate documents used in finance. This setup contrasts with "closed book" conditions, where models answer without external reference material. FinanceBench’s diverse questions enable targeted testing of LLM performance on tasks such as extracting specific financial metrics, performing calculations, and handling reasoning tasks related to financial documents.
Challenges in Financial Question Answering
Financial questions present specific challenges for LLMs, particularly in terms of knowledge currency, domain-specificity, structured and unstructured data handling, and the need for nuanced reasoning. Financial professionals rely on highly specialized information and terminologies that general-purpose models may not have seen in their training data. Additionally, finance is a time-sensitive field where current information is essential, complicating QA for models with static pre-training data. FinanceBench evaluates LLM performance with these complexities in mind, requiring models to process both tabular and textual data, manage long texts effectively, and handle information spread across multiple documents.

Evaluating Model Performance with FinanceBench
FinanceBench tested 16 LLM configurations, including GPT-4-Turbo, Llama2, and Claude2, using various setups such as closed book, single-document vector stores, shared vector stores, and long-context abilities. Finance professionals reviewed each model’s responses to assess accuracy and identify issues like hallucinations. In closed-book scenarios, models generally performed poorly; GPT-4-Turbo, for instance, correctly answered only 9% of questions, highlighting the importance of relevant data access.
More sophisticated setups, like long-context configurations and vector stores, led to improvements by allowing models to retrieve specific sections of financial documents. For instance, GPT-4-Turbo achieved a 50% success rate when each document received its own vector store, although the approach has limitations in real-world applications due to latency and storage demands. Long-context configurations, used by models like Claude2 and GPT-4-Turbo, produced the highest performance levels, but even these setups struggled with questions requiring complex calculations across multiple documents.

Question Types and Model Performance Analysis
FinanceBench organizes questions into three main categories. Domain-relevant questions are applicable to any publicly traded company and cover basics like dividend payments or operating margins, with models performing best on these simpler questions. Novel generated questions, created by analysts, are more complex and company-specific, reflecting real-world needs in financial decision-making. Metrics-generated questions require precise calculations using data from financial statements; models generally performed worst on these, especially when the questions required information from multiple documents or complex mathematical reasoning.
Performance by question type reveals that models perform best on straightforward data extraction tasks but struggle significantly with numerical reasoning and multi-document analysis, suggesting that blending different reasoning types within a single answer remains a challenge for LLMs.
Challenges and Future Directions for FinanceBench
FinanceBench highlights several obstacles to deploying LLMs effectively in financial contexts. Financial data complexity and the high accuracy expectations of the industry make even small model errors costly. The sequence in which models encounter questions and context also impacts performance, especially in cases where relevant information is embedded deep within long documents. Furthermore, models often hallucinate, producing plausible but incorrect answers, which poses a risk in high-stakes finance scenarios where users may inadvertently trust AI-generated answers.
FinanceBench’s creators recommend that finance professionals rigorously evaluate AI model outputs, use multiple data sources, and validate answers for accuracy. They encourage further research into specialized retrieval techniques, incorporating tools like calculators for calculations, and expanding finance-specific training data. Through ongoing improvements, FinanceBench aims to drive advancements that will enable AI systems to meet the complex demands of the financial sector effectively.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.