Step for developing and evaluating RAG application with W&B
This report explains a flow of RAG hands-on with W&B and contains useful assets to learn more.
Created on October 24|Last edited on December 20
Comment
Overview
RAG has become one of the most popular LLM-based applications. In this report, we'll explore how to develop and evaluate RAG using W&B. For those interested in the details of how to construct RAG, there's a comprehensive training available that delves deep into it. Please refer to it for a detailed guide.
(FYI: "Training and Fine-tuning Large Language Models (LLMs)" is also provided in the context of LLM.)
In this report, we'll look at how to do just that, evaluating our LLM-powered documentation application we call Wandbot. If you have not already interacted with wandbot, consider joining our Discord Server and head to the #wandbot channel.
💡

Wandbot in discord
This report will follow the following steps.
Step0 : Firstly, let's start from wandb login.
Step1 : While RAG involves various processes, this report will initially visualize them using a feature called traces. Before transitioning to the evaluation of hundreds of samples, let's assess a few to see if the predictions seem accurate. If there are any issues, traces can help identify at which phase the problem arises.
Step2: Subsequently, we'll move on to the evaluation of RAG. While there are various evaluation methods, we'll specifically look into model-based evaluation here.
Step3: Finally, using Sweeps, let's determine which pattern seems most promising.
Step0: Wandb setup
For wandb setup, please refer this document. If you are using W&B dedicated cloud, VPC, or on-prem, please refer this and the following figure.

Step1: Visualization of intermediate process with Traces
In this notebook, you can learn the following points.
Integration with LangChain and Traces
- You can easily log results with os.environ["LANGCHAIN_WANDB_TRACING"] = "true" when using langchain (official document).
- With traces, you can visualize not only pairs of input and output but also the intermediate processes.
Traces is going to be used in Weave dashboard too. You can visualize long sentences more nicely there. (The feature is private beta)
💡
Run: solar-flower-61
1
Docs and prompt template management with Artifacts
- You can store your data (docs, prompt template, etc.) and manage the version with Artifact (official document).

Experiment comparison
Multiple experiments can be tracked in the same dashboard, and you can easily switch the result shown in the dashboard. And, each setting can be tracked from overview of each run. (official document)
Let's try some methods!

Step2: LLM-based evaluation
Evaluating a LLM-based system isn't easy. It requires multiple steps and many weeks of deep thought. "How to Evaluate, Compare, and Optimize LLM Systems?" tries to cover the whats and hows of evaluating an LLM-based system. This report won't go into detail here (you can read the piece for that), but broadly speaking, there are three main categories we looked at:
- Eyeballing: While building a baseline LLM system, we usually eyeball to evaluate the performance of our model. In other words: is it behaving largely the way we expect?
- Supervised: This is the recommended way to evaluate LLM apps where you involve humans to generate an annotated dataset for evaluation.
- LLMs evaluate LLMs: In this paradigm, we leverage a powerful LLM to generate proxy targets based on some context. In the case of our QA bot, we can ask an LLM to generate question-answer pairs.
We have already gone through extensive "eyeballing" phase of evaluation to some extent in Step1. In Step2, let's evaluate our RAG with LLM.
Generate Eval Dataset using LLM
Firstly, let's create the evaluation dataset first. We need pairs of questions and answers in our evaluation set to actually evaluate a QA bot. One feasible way of creating such a dataset is to leverage an LLM. This approach has obvious benefits and limitations:
Benefits
- It's scalable. We can generate a vast number of test cases on demand.
- It's flexible. The test cases can be generated for special edge cases and adapted to multiple domains, ensuring relevance and applicability.
- It's cheap and fast. LLMs can quickly collate information from multiple documents at a far lower price.
Limitations
- It doesn't work in use cases where you need expert labelers (such as medical domain).
- It doesn't evaluate the performance to actual user usage. This method cannot deal with unexpected situations and edge cases.
Tool to generate eval dataset
Langchain has a useful chain called QAGenerationChain, which can extract pairs of questions and answers from specific document(s). We can load the document(s) using the relevant data loader (great piece by Hamel here), split it into smaller chunks, and use the chain to extract QA pairs.
Run: gentle-feather-64
1
Evaluate your RAG with the evaluated dataset
Now that we have an eval set of QA pairs, we can let our LLM-based QA bot generate predictions for the questions. We can then use a metric to evaluate the predicted and "true" answers. Given a predicted and a "true" answer, we can literally use an LLM to find how well the prediction is compared to the true answer!
LLMs are powerful because they now have a good understanding of the semantics of the text. Given two texts (true and predicted answers), an LLM can, in theory, find whether they are semantically identical. If identical, we can give that prediction of accuracy score from 0 to 10, where 0 is the lowest (very low similarity) and 10 is the highest (very high similarity).
Luckily, Langchain has a chain called QAEvalChain that can take in a question and "true" answer along with the predicted answer and output scores or "CORRECT" and "INCORRECT" labels for them.
Try this Google colab! And, check out the W&B Table with one such evaluation job where an LLM was used as a metric .
Run: divine-moon-71
1
Step3: Optimization
Given we have an eval set, let's use W&B Sweeps to quickly set up a hyperparameter optimization search component that will improve a metric!
Check (for example) which combination of methods with gpt3.5 can achieve a similar score with that with gpt4.
Sweep: 4mwvu0r0 1
10
Sweep: 4mwvu0r0 2
0
Reference
Please check the following reports and course too.
A Gentle Introduction to LLM APIs
In this article, we dive into how large language models (LLMs) work, starting with tokenization and sampling, before exploring how to use them in your applications.
Prompt Engineering LLMs with LangChain and W&B
Join us for tips and tricks to improve your prompt engineering for LLMs. Then, stick around and find out how LangChain and W&B can make your life a whole lot easier.
How to Evaluate, Compare, and Optimize LLM Systems
This article provides an interactive look into how to go about evaluating your large language model (LLM) systems and how to approach optimizing the hyperparameters.
How to Evaluate an LLM, Part 1: Building an Evaluation Dataset for our LLM System
Building gold standard questions for evaluating our QA bot based on production data.
How to Fine-Tune an LLM Part 1: Preparing a Dataset for Instruction Tuning
Learn how to fine-tune an LLM on an instruction dataset! We'll cover how to format the data and train a model like Llama2, Mistral, etc. is this minimal example in (almost) pure PyTorch.
How to evaluate an LLM Part 3: LLMs evaluating LLMs
Employing auto-evaluation strategies to evaluate different component of our Wandbot RAG-based support system.

Training and Fine-tuning Large Language Models (LLMs)
Explore the architecture, training techniques, and fine-tuning methods for creating powerful LLMs. Gain theory and hands-on experience from Jonathan Frankle (MosaicML), and other industry leaders, and learn cutting-edge techniques like LoRA and RLHF.

Building LLM-Powered Apps
Learn how to build LLM-powered applications using LLM APIs, Langchain and W&B Prompts. This course will guide you through the entire process of designing, experimenting, and evaluating LLM-based apps.
Add a comment