Skip to main content

Evaluate your RAG pipeline using LLM as a Judge with custom dataset creation (Part 2)

Generate the dataset for the Financial documents and evaluate your RAG workflow built upon LangGraph, Qdrant and OpenAI.
Created on November 28|Last edited on December 3
This is the Part 2 article in the ongoing series of building production grade LLM based application. Check out the first article, where we built the structure of the Agentic RAG on the financial data using LangGraph, Qdrant, Weave and OpenAI.
💡
Most developers skip evaluation. They build RAG systems, deploy them, and hope for the best. But here's the reality: without proper evaluation, you're flying blind. You don't know if your retrieval is finding the right documents. You don't know if your generation is staying grounded. You don't know if users are getting accurate answers. And when something breaks in production, you have no systematic way to debug it.
In this article, we will walkthrough through building a complete evaluation workflow for the RAG pipelines. We'll cover dataset generation, LLM as a Judge approaches, RAG triad metrics, and traditional ML-based evaluation methods. We'll also integrate end-to-end tracing and observability with W&B Weave, so you can debug issues and track improvements over time.
Here's what we'll be covering:


The fundamental problem with evaluating RAG systems

RAG evaluation is harder than it looks. Traditional ML evaluation assumes you have ground truth labels and can calculate accuracy. But in RAG systems, there's no single "correct" answer. As per our data used for RAG, the end user, i.e., the financial analyst asking "What drove Netflix's subscriber growth in APAC?" could receive multiple valid responses depending on which documents were retrieved and how the information was synthesized.
Users ask questions you never anticipated. Documents get updated. The retrieval system returns different chunks in response to subtle query variations. And LLMs can generate responses that sound correct but are completely fabricated.
You need a multi-layered evaluation approach that checks:
  • Retrieval quality: Are you finding the right documents?
  • Context relevance: Is the retrieved information actually relevant to the question?
  • Answer groundedness: Is the generated response based on retrieved documents or hallucinated?
  • Answer relevance: Does the response actually address what the user asked?
  • Factual correctness: Are the facts in the response accurate?
Each of these requires different evaluation techniques. Some need LLM judges. Some need traditional metrics. Some need human review. So let's get started, step by step, to approach this problem statement.

How to generate the dataset to evaluate your RAG workflow?

Dataset generation is the step where you create synthetic data for the indexed content across different scenarios. This is usually where most evaluation pipelines break. You cannot ask random questions and call it an evaluation. You need diversity, edge cases, and questions that reflect real user behavior.
Here's the systematic approach that has been adapted across multiple best practices to build Evaluation Driven Development (EDD):

Step 1: Get a domain expert

This is non-negotiable. You need someone who deeply understands your data, your users, and your domain. This person provides domain-specific keywords to generate the dataset and also validates it. Their role is important as synthetic generation, no matter how sophisticated, lacks the nuanced understanding of what real users actually need.
A domain expert catches the subtle errors that automated systems miss:
  • questions that are technically grammatical but nonsensical in practice,
  • answers that are factually correct but operationally useless, or
  • edge cases that represent common failure modes in your specific domain.
They bring institutional knowledge of how users actually interact with your system, the terminology they use, and which questions indicate confusion versus expertise.
This domain expert must review not only the questions but also the expected answers, the retrieved context, and the relationships among them. If your ground truth answers are incorrect, the entire evaluation pipeline becomes worse than useless, as it will reward systems that match incorrect answers. I know this might be a tedious and time-consuming task, but you can start with 40-50 questions and gradually increase the set.

Step 2: Understand different personas and scenarios

This framework is based on evaluation-driven development principles. You need three dimensions to generate a comprehensive dataset:
  1. Understanding the use case, i.e., the Features, is the first step. In our case, the use case is simple because it is a Question and Answer. But our application is also Agentic RAG, which includes a tool that summarizes user queries and helps you search the web for results. So you need a clear understanding of the use case to ensure the dataset is generated to match what the system is actually expected to do.
  2. Personas: Who are your actual users and how do they interact with your system? Most teams fail here because they create questions from a single perspective, usually their own, and miss how different users behave. A technical expert asks precise questions with domain terminology. A non-expert asks vague questions using everyday language. A student asks broad exploratory questions. A compliance officer asks focused questions about specific regulatory details. Each persona generates questions with different characteristics. When we generate the dataset, we also define different personas representing the possible users of your application.
  3. Edge Cases or Scenarios: You need coverage across simple single-hop questions where the answer sits in one piece of context, as well as multi-hop questions that require connecting information across several pieces. You also need out-of-context questions where the answer does not exist in your documents, plus ambiguous questions that can be understood in more than one way.
The combination of features × personas × scenarios creates a comprehensive dataset. Once you have the dataset, you have the domain expert to validate it. Now the question is, how to generate the dataset for the given document? We will use RAGAS, an open-source tool, to evaluate your LLM-based application.

Create the dataset for evals using RAGAS

Ragas fixes a core gap in how we evaluate LLM apps. Traditional metrics miss what actually matters in real usage, and manual checks cannot scale once the system grows. Ragas combines LLM-driven metrics with structured experiments, giving you a steady improvement loop without guesswork. Along with the evals flow and supportivity, Ragas also helps with dataset creation, so you can later evaluate the RAG component by simply plugging in the dataset and running the checks.
Now, in addition to the features, scenarios, and personas, it is just as important to understand what data is passed to the LLM during dataset creation. Splitting the headers helps organise the content, highlight the important keyphrases, and map how they relate to each other. This is where Ragas separates itself from simple dataset generation. Instead of asking an LLM to create questions from raw text, Ragas builds a knowledge graph that captures the connections between concepts.
We use transforms to extract two kinds of information:
  • Headlines that identify major topics and section headers that represent core ideas.
  • Key phrases that capture important domain-specific terms and the relationships among them.
In my testing, datasets generated with knowledge graphs had significantly better diversity and realism compared to simple LLM prompting. We will use this approach as apply_transforms in the code demo.

LLM as a judge to evaluate RAG workflow

LLM as a Judge means using a large language model to evaluate the outputs of another LLM. It works by feeding the judge LLM the question, the generated answer, and any relevant context, then asking it to check how well they match and whether the answer is valid. The key idea for LLM-as-a-Judge in RAG is to use binary scoring for relevance checks instead of numeric scales, because true or false is clear, while a score like four versus three often feels vague and unreliable when you tell LLM to judge on a scale of 5.
For RAG evaluation, there are only 6 evals (Jason Liu). But our main focus will be the RAG Triad among these 6.
Fig.1: RAG Triad

What are RAG Triad evals?

RAG Triad is a framework for evaluating the three critical components of retrieval-augmented generation: Context relevance, Answer relevance, and Faithfulness/Groundedness. A RAG system works with three main pieces i.e., the variables are the question (Q), the retrieved context (C), and the answer (A).
  • Context Relevance (C|Q): Is the retrieved context relevant to the question? You're checking if your retrieval system found the right information.
  • Answer Relevance (A|Q): Is the generated answer relevant to the question? You're checking if your LLM stayed on topic and didn't hallucinate.
  • Faithfulness/Groundedness (A|C): Is the answer grounded in the provided context? You're checking if the LLM used the retrieved information rather than making things up.
A system can fail any of these independently. You might retrieve relevant context but generate an answer that ignores it. You might generate a grounded answer that doesn't address the question. You might generate a relevant answer that isn't grounded in your documents. The Triad approach catches all three failure modes. Implement it using LLM-as-a-Judge with binary scoring for each dimension. Trace & Monitor the results across your entire evaluation dataset to identify systematic issues.

Metrics-based evals to evaluate your workflow

Beyond LLM judges, traditional ML metrics provide objective measurements. These are especially valuable because they don't depend on another LLM that could introduce its own biases. Remember the confusion matrix? Yes, the thing from classical machine learning? Here, we'll make use of Precision and Recall.
  • Context Recall: Measures what percentage of relevant information was retrieved. Calculated as the fraction of reference context that appears in retrieved chunks.
  • Context Precision: Measures retrieval accuracy using ranking. Are relevant chunks ranked higher than irrelevant ones?
The precision and recall formulas work just like in classical ML, but applied to information retrieval:
  • Recall = (Relevant retrieved chunks) / (Total relevant chunks)
  • Precision = (Relevant retrieved chunks) / (Total retrieved chunks)
LLM judges catch semantic issues that metrics miss. Metrics provide objective baselines that aren't subject to prompt sensitivity. Once you have this in place. We use Weave to log every query, every retrieval, and every generation. Track costs, latency, and token usage. The evaluation loop never ends. Generate datasets, run evaluations, identify weaknesses, improve your system, and repeat.
Let's implement this workflow using RAGAS:

End-to-end evals workflow to generate a dataset and evaluate the RAG pipeline

Recap from Part-1 article

Reminder, we are building the evals for the Agentic RAG workflow from part 1: Build a Financial Agentic RAG.
Fig.2: Agentic RAG workflow
Here is a quick recap on how to infer the user response for the given graph that is already compiled.
query = "Based on Netflix’s most recent 10-K filing, what were the key drivers of subscriber growth on global region"
result = graph.invoke({"question":query},config=config)

print(result['tool_used'])
print(result['response'])

Now, let's continue with dataset generation and then the evals.

Initial setup

The installation version and the dependency remain the same as the previous article; the new addition here is RAGAS.
pip install langgraph langchain-openai==1.0.2 langchain-community
pip install weave
pip install qdrant_client fastembed pypdfium2
pip install tavily-python
pip install ragas

Step 1: Load the chunks into the Knowledge Graph

Start with your actual documents i.e., the same ones you have used for the Agentic RAG workflow retrieved from. Load them using the exact same parsing method you use in production. For this implementation, we're using PyPDFium2Parser for PDF documents that understand the tabular content well, as mentioned in our earlier article.
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import PyPDFium2Parser

from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType

loader = GenericLoader(
blob_loader=FileSystemBlobLoader(
path="data",
glob="**/*.pdf",
),
blob_parser=PyPDFium2Parser(),
)
documents = loader.load()
Fig.3: Raw documents to Knowledge Graph using RAGAS
Each document or chunk gets added to a knowledge graph as a node. RAGAS uses this graph structure to understand document relationships and generate questions that reflect how information is actually organized in your data. The nodes is later transformed into headlines and keyphrases to get the relationship pattern.
kg = KnowledgeGraph()

for doc in documents:
kg.nodes.append(
Node(
type=NodeType.DOCUMENT,
properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
)
)

Step 2: Initialize the evaluation LLM and embedding model

Next, configure your LLM and embedding models. We're using GPT-4o-mini for question generation and OpenAI embeddings.
Get your OpenAI API Keys from here: https://platform.openai.com/
from openai import OpenAI
from langchain_openai import ChatOpenAI
from ragas.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper

import os
os.environ["OPENAI_API_KEY"] = "<replace-with-your-api-key>"

model_name = "gpt-4.1-nano"
eval_llm = ChatOpenAI(model=model_name)
evaluator_llm = LangchainLLMWrapper(eval_llm)

openai_client = OpenAI()
evaluator_embed = OpenAIEmbeddings(client=openai_client)

Step 3: Apply transforms with Knowledge Graph

Instead of just asking an LLM to generate questions from raw text, RAGAS supports a feature that builds a knowledge graph that captures relationships between concepts.
The transforms extract two types of information:
  • Headlines: Major topics and section headers that represent key concepts
  • Key phrases: Important domain-specific terms and their relationships
Why does this matter? Because it creates context-aware questions. Instead of generic "What is X?" questions, you get questions that reflect the actual structure and relationships in your documents. For financial documents, this means questions about how different metrics relate, how events impact outcomes, and how information changes over time.
You can apply NER-based transformations as well, based on your use case, if the data mentions more people and places. Check the docs for NER Exactor transforms.
from ragas.testset.transforms import apply_transforms
from ragas.testset.transforms import HeadlinesExtractor, HeadlineSplitter, KeyphrasesExtractor

headline_extractor = HeadlinesExtractor(llm=evaluator_llm, max_num=20)
headline_splitter = HeadlineSplitter(max_tokens=2000)
keyphrase_extractor = KeyphrasesExtractor(llm=evaluator_llm)

transforms = [
headline_extractor,
headline_splitter,
keyphrase_extractor
]
apply_transforms(kg, transforms=transforms)
Fig.4: Extract Headlines and Keyphrases.

Step 4: Define different persona schema

Each persona generates questions with its own characteristics. Think about who the actual users of your application are, and define their roles and descriptions clearly.
from ragas.testset.persona import Persona

personas = [
Persona(
name="Investment Analyst",
role_description="Analyzes company financials and performance metrics from SEC filings to make investment recommendations. Focuses on extracting key financial data, identifying trends, and comparing quarterly results."
),
Persona(
name="Portfolio Manager",
role_description="Makes strategic investment decisions based on company analysis. Reviews shareholder letters for management insights and SEC filings for comprehensive financial assessment and risk evaluation."
),
Persona(
name="Risk Analyst",
role_description="Reviews SEC filings for risk disclosures, legal issues, and compliance matters. Tracks changes in risk factors, regulatory concerns, and material events that could impact company stability."
),
Persona(
name="Individual Investor",
role_description="Personal investor who reads company filings and shareholder letters to make informed investment decisions. Seeks clear explanations of financial performance and business strategy updates."
),
Persona(
name="Financial Reporter",
role_description="Business journalist covering corporate earnings and market developments. Uses SEC filings and shareholder letters to verify facts, source stories, and provide accurate financial reporting."
),
Persona(
name="Business Student",
role_description="MBA or finance student learning about corporate finance and investment analysis. Uses real company filings and shareholder letters for case studies, understanding business models, and learning how to read financial statements."
),
Persona(
name="Company Employee",
role_description="Netflix employee interested in understanding their company's financial health, strategic direction, and competitive position. Reads shareholder letters and news to stay informed about company performance and future plans."
)
]

Step 5: Different distribution and dataset generation

Now that the persona is set, the next step is choosing the scenario. To do that, you need to define the types of questions you want and their proportions for single-hop or multi-hop.
Since our application is not built for multi-hop, we will use single-hop synthesizers for both the headline and keyphrase transforms used for the knowledge graph, with more focus on keyphrases (0.6 - 60%) than headlines (0.4 - 40%).
from ragas.testset.synthesizers.single_hop.specific import SingleHopSpecificQuerySynthesizer,

query_distibution = [
(SingleHopSpecificQuerySynthesizer(llm=evaluator_llm, property_name="keyphrases"),0.6),
(SingleHopSpecificQuerySynthesizer(llm=evaluator_llm, property_name="headlines"),0.4),
]
Combine the LLM and embedding models with the TestsetGenerator object and start with 10-20 questions for initial validation. Have your domain expert review every single one. Once you've confirmed quality, scale to 40-50 questions.
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
llm=evaluator_llm,
embedding_model=evaluator_embed,
knowledge_graph=kg,
persona_list=personas,
)
print("Generating testset...")
testset = generator.generate(testset_size=10, query_distribution=query_distibution)
df = testset.to_pandas()
print(df.head())
The output DataFrame has three main columns:
  • user_input: The generated question
  • reference_contexts: The chunks that should be retrieved (ground truth)
  • reference: The expected answer (ground truth)
But you're not done yet. This is your ground truth dataset. Now you need to generate predictions from your actual RAG system.


Step 6: Generate Model response and retrieved context

Your evaluation dataset needs five columns:
  • Question: The user query
  • Reference context: What should be retrieved (ground truth)
  • Reference answer: What should be generated (ground truth)
  • Retrieved context: What your system actually retrieved
  • Generated answer: What your system actually generated
Now pass the user-input question to the Agentic RAG graph workflow, which was covered in Part 1 and discussed above in the Recap section.
answers= []
context = []

for index, row in df.iterrows():
results = graph.invoke({"question":row["user_input"]})
answers.append(results["response"])
context.append(results["context"])

df['response'] = answers
df['retrieved_contexts'] = context
YES WE HAVE FINALLY GENERATED THE DATASET FOR EVALS. Now its time to trace and monitor each version of the dataset you generate using Weave.


Step 7: Trace and monitor your datasets using weave

This is critical for production systems. Every dataset generation creates a new version using Weave:
Get your Wandb API key from: https://wandb.ai/authorize
Once the dataset is published, you can find the traces under the Assets section.
import weave
from weave import Dataset

os.environ['WANDB_API_KEY'] = "<replace-with-your-wandb-key>"
weave.init('evals')

trace_dataset = Dataset.from_pandas(df)
weave.publish(trace_dataset)


Step 8: Evaluate the app using the dataset generated

Now that we have our evaluation dataset, we define the metrics RAGAS will use for assessment. We're combining metrics-based evaluation (context recall, precision) with LLM-as-a-Judge approaches (faithfulness, factual correctness) to get both objective measurements and semantic understanding of our RAG system's performance.
from ragas import evaluate
from ragas import EvaluationDataset
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ContextPrecision

# retrieved context and reference context is in string, we need to convert it into list.
def convert_to_list(x):
return [x] if isinstance(x, str) else (x if isinstance(x, list) else [])

df['retrieved_contexts'] = df['retrieved_contexts'].apply(convert_to_list)
df['reference_contexts'] = df['reference_contexts'].apply(convert_to_list)

eval_dataset = EvaluationDataset.from_pandas(pd.DataFrame(df))

model_name = "gpt-4.1-nano"
eval_llm = ChatOpenAI(
model=model_name
)
evaluator_llm = LangchainLLMWrapper(eval_llm)
Now we use Weave to trace and monitor each entry by adding @weave.op for custom tracing for the given function.
@weave.op
def run_evals(data: EvaluationDataset):
result = evaluate(
dataset = data,
metrics=[
Faithfulness(),
FactualCorrectness(),
LLMContextRecall(),
ContextPrecision()
],
llm = evaluator_llm
)
return result

run_evals(eval_dataset)


Conclusion

That is it, folks. We have seen how to generate the dataset and run evals for Agentic RAG workflows. But the process does not end here. Evaluation is iterative. Once the scores are updated, we review the Weave traces and logs to check if the results make sense. Based on that, we try new techniques, adjust prompts, or add reranks. If the results improve, we continue refining the workflows.
Source: Eugenyan Blog on Eval process and Eval-driven development (EDD)

Sources

Iterate on AI agents and models faster. Try Weights & Biases today.