Skip to main content

Step for developing and evaluating RAG application with W&B

This report explains a flow of RAG hands-on with W&B and contains useful assets to learn more.
Created on October 24|Last edited on December 20

Overview

RAG has become one of the most popular LLM-based applications. In this report, we'll explore how to develop and evaluate RAG using W&B. For those interested in the details of how to construct RAG, there's a comprehensive training available that delves deep into it. Please refer to it for a detailed guide.
(FYI: "Training and Fine-tuning Large Language Models (LLMs)" is also provided in the context of LLM.)
In this report, we'll look at how to do just that, evaluating our LLM-powered documentation application we call Wandbot. If you have not already interacted with wandbot, consider joining our Discord Server and head to the #wandbot channel.
💡

Wandbot in discord

This report will follow the following steps.
Step0 : Firstly, let's start from wandb login.
Step1 : While RAG involves various processes, this report will initially visualize them using a feature called traces. Before transitioning to the evaluation of hundreds of samples, let's assess a few to see if the predictions seem accurate. If there are any issues, traces can help identify at which phase the problem arises.
Step2: Subsequently, we'll move on to the evaluation of RAG. While there are various evaluation methods, we'll specifically look into model-based evaluation here.
Step3: Finally, using Sweeps, let's determine which pattern seems most promising.


Step0: Wandb setup

For wandb setup, please refer this document. If you are using W&B dedicated cloud, VPC, or on-prem, please refer this and the following figure.


Step1: Visualization of intermediate process with Traces

Try this Google colab!
In this notebook, you can learn the following points.
Integration with LangChain and Traces
  • You can easily log results with os.environ["LANGCHAIN_WANDB_TRACING"] = "true" when using langchain (official document).
  • With traces, you can visualize not only pairs of input and output but also the intermediate processes.
Traces is going to be used in Weave dashboard too. You can visualize long sentences more nicely there. (The feature is private beta)
💡


Run: solar-flower-61
1



Docs and prompt template management with Artifacts
  • You can store your data (docs, prompt template, etc.) and manage the version with Artifact (official document).


Experiment comparison
Multiple experiments can be tracked in the same dashboard, and you can easily switch the result shown in the dashboard. And, each setting can be tracked from overview of each run. (official document)
Let's try some methods!




Step2: LLM-based evaluation

Evaluating a LLM-based system isn't easy. It requires multiple steps and many weeks of deep thought. "How to Evaluate, Compare, and Optimize LLM Systems?" tries to cover the whats and hows of evaluating an LLM-based system. This report won't go into detail here (you can read the piece for that), but broadly speaking, there are three main categories we looked at:
  • Eyeballing: While building a baseline LLM system, we usually eyeball to evaluate the performance of our model. In other words: is it behaving largely the way we expect?
  • Supervised: This is the recommended way to evaluate LLM apps where you involve humans to generate an annotated dataset for evaluation.
  • LLMs evaluate LLMs: In this paradigm, we leverage a powerful LLM to generate proxy targets based on some context. In the case of our QA bot, we can ask an LLM to generate question-answer pairs.
We have already gone through extensive "eyeballing" phase of evaluation to some extent in Step1. In Step2, let's evaluate our RAG with LLM.

Generate Eval Dataset using LLM

Firstly, let's create the evaluation dataset first. We need pairs of questions and answers in our evaluation set to actually evaluate a QA bot. One feasible way of creating such a dataset is to leverage an LLM. This approach has obvious benefits and limitations:
Benefits
  • It's scalable. We can generate a vast number of test cases on demand.
  • It's flexible. The test cases can be generated for special edge cases and adapted to multiple domains, ensuring relevance and applicability.
  • It's cheap and fast. LLMs can quickly collate information from multiple documents at a far lower price.
Limitations
  • It doesn't work in use cases where you need expert labelers (such as medical domain).
  • It doesn't evaluate the performance to actual user usage. This method cannot deal with unexpected situations and edge cases.
Tool to generate eval dataset
Langchain has a useful chain called QAGenerationChain, which can extract pairs of questions and answers from specific document(s). We can load the document(s) using the relevant data loader (great piece by Hamel here), split it into smaller chunks, and use the chain to extract QA pairs.

Try this Google colab!


Run: gentle-feather-64
1




Evaluate your RAG with the evaluated dataset

Now that we have an eval set of QA pairs, we can let our LLM-based QA bot generate predictions for the questions. We can then use a metric to evaluate the predicted and "true" answers. Given a predicted and a "true" answer, we can literally use an LLM to find how well the prediction is compared to the true answer!
LLMs are powerful because they now have a good understanding of the semantics of the text. Given two texts (true and predicted answers), an LLM can, in theory, find whether they are semantically identical. If identical, we can give that prediction of accuracy score from 0 to 10, where 0 is the lowest (very low similarity) and 10 is the highest (very high similarity).
Luckily, Langchain has a chain called QAEvalChain that can take in a question and "true" answer along with the predicted answer and output scores or "CORRECT" and "INCORRECT" labels for them.
Try this Google colab! And, check out the W&B Table with one such evaluation job where an LLM was used as a metric .


Run: divine-moon-71
1




Step3: Optimization

Given we have an eval set, let's use W&B Sweeps to quickly set up a hyperparameter optimization search component that will improve a metric!
Try this Google colab !
Check (for example) which combination of methods with gpt3.5 can achieve a similar score with that with gpt4.

Sweep: 4mwvu0r0 1
10
Sweep: 4mwvu0r0 2
0





Reference

Please check the following reports and course too.