Building and evaluating a RAG system with DSPy and W&B Weave
A guide to building a RAG system with DSPy, and evaluating it with W&B Weave.
Created on October 28|Last edited on November 13
Comment
Retrieval-augmented generation (RAG) is transforming the way language models operate by integrating external data sources to produce more informed and relevant outputs. In this tutorial, we explore how to build and optimize RAG applications using DSPy, a powerful framework designed to simplify the complexities of prompt engineering.
I will guide you through understanding both RAG and DSPy, covering their individual roles and demonstrating how to leverage them together for high-performing AI systems. Whether you want to maintain up-to-date internal documentation through chatbots or pipe live news data for stock market sentiment analysis, mastering DSPy will help streamline your RAG implementations. Additionally, we will use Weave evaluations to assess and refine the performance of the RAG pipeline, ensuring that your models deliver accurate and reliable outputs.
Jump to the tutorial

Table of contents
Understanding retrieval-augmented generation (RAG) How RAG enhances language models What is DSPy? Core components of DSPy Using DSPy with W&B Weave Building a RAG pipeline with DSPy Setting up Weaviate for your RAG pipelineUploading documents to WeaviateBuilding out the RAG pipeline with DSPyCompiling and executing the RAG programEvaluating performance with Weave evaluations Conclusion
Understanding retrieval-augmented generation (RAG)
Retrieval-augmented generation enhances the capabilities of language models by enabling them to retrieve and incorporate real-time, external information into their responses. This approach ensures outputs are not only coherent but also current, addressing the limitations of pre-trained models that rely solely on static datasets. RAG works by combining two core processes: retrieval and generation.
First, it searches external databases or repositories for relevant information; then, the retrieved content informs the generated output. This dual process helps models stay accurate and relevant even in fast-changing domains, such as financial news or customer support systems, where timely and precise information is essential.
How RAG enhances language models
RAG significantly improves the reliability and applicability of language models. Traditional models often struggle with outdated knowledge and may generate "hallucinations," or incorrect information. RAG mitigates these issues by grounding responses in up-to-date data retrieved during inference.
For example, internal chatbots powered by RAG can offer employees the most recent policy information, or news-tracking systems can extract real-time sentiment data for stock market analysis. This makes RAG particularly valuable in dynamic settings, enhancing user trust and experience with AI systems.
What is DSPy?
DSPy is a framework that simplifies RAG pipeline implementation by abstracting prompt engineering into a structured programming approach. Developers can define task behavior using modular components, reducing the complexity of managing RAG systems while improving model consistency and efficiency.
Core components of DSPy
DSPy provides a framework for managing the flow of data from retrieval to generation. In this framework, there are a few core components, such as Signatures, Modules, and Optimizers.
Signatures allow developers to define the inputs and outputs of a task clearly, ensuring that models operate consistently. Here's an example of how you can define a Signature in code:
class GenerateAnswer(dspy.Signature):"""Signature for generating answers based on retrieved context."""context = dspy.InputField(desc="Relevant facts") # Input field for factsquestion = dspy.InputField() # Input field for user questionsanswer = dspy.OutputField(desc="Short fact-based answer") # Output field for the generated answer
In this example, the GenerateAnswer signature ensures that the module receives a question and context and produces a short answer as output. This structure guarantees that inputs and outputs are consistent throughout the RAG pipeline.
Modules encapsulate various prompting strategies, like Chain-of-Thought or ReAct, which can be swapped out depending on the use case. Optimizers fine-tune the entire system, refining the instructions and parameters for optimal performance, often with minimal data. Here's an example of a RAG module:
class RAG(dspy.Module):def __init__(self, num_passages=3):super().__init__()self.retrieve = dspy.Retrieve(k=num_passages) # Retrieve relevant contentself.generate_answer = dspy.ChainOfThought(GenerateAnswer) # Use the Chain-of-Thought methoddef forward(self, question: str):context = self.retrieve(question).passages # Fetch top-k relevant passagesprediction = self.generate_answer(context=context, question=question) # Generate answerreturn dspy.Prediction(context=context, answer=prediction.answer)
This module retrieves relevant passages from a database and generates a response using the Chain-of-Thought strategy. It encapsulates the logic needed to coordinate retrieval and response generation.
DSPy optimizers are designed to enhance the performance of tasks by fine-tuning various parameters within a DSPy program according to specific metrics. These optimizations focus on minimizing errors or maximizing performance, often measured through metrics like accuracy or recall. The process involves iterative adjustments, not only to the instructional prompts but also to the underlying language model parameters where possible. This multi-stage optimization may involve methods such as gradient descent for weight modifications, as well as discrete tuning of task demonstrations and prompt structures. These automated improvements streamline the creation of effective prompts, often achieving results that would be difficult to replicate manually.
A DSPy program consists of multiple modules, each handling specific logic in the task flow. Optimizers within this framework aim to optimize the parameters of the system, similar to how an optimizer in a traditional deep learning system would optimize the parameters for a model. In cases where weight modification is possible, such as with local models, the optimizer can adjust the language model’s internal weights to improve task outcomes. For models accessed through APIs or where weight tuning is not available, the focus shifts to refining instructional prompts and optimizing task-specific demonstrations. These demonstrations, acting as examples that guide the model’s responses, are generated and improved through iterative optimization, ensuring they remain aligned with the task requirements and enhance coherence across different inputs.
DSPy optimizers play a big role in refining both the program’s structure and how modules interact. Prompts and instructions are frequently adjusted to improve consistency, ensuring the model interprets inputs as intended. When demonstrations are involved, the optimizer selects or generates examples that reinforce the desired behavior, validating them against predefined metrics. In scenarios where weight tuning is feasible, the optimizer directly updates the model's internal parameters to align with the optimized prompt structures. This combination of prompt optimization, task demonstration refinement, and optional weight modification creates a flexible framework that adapts to different needs, whether working with pre-trained models, proprietary solutions, or API-based systems.
Each optimization step relies heavily on metrics to measure progress. These metrics can range from simple accuracy scores to complex logic that evaluates how well a task has been performed. Through this iterative process, DSPy optimizers ensure that the entire pipeline becomes more effective, reducing errors and improving the relevance and quality of outputs. Whether optimizing the interaction between modules, adjusting prompts, or directly fine-tuning weights, DSPy provides a robust framework for building high-performing language model applications.
Using DSPy with W&B Weave
DSPy integrates seamlessly with Weave, the lightweight toolkit from Weights & Biases, allowing you to track and evaluate your language model applications with minimal setup. With this integration, all interactions within your DSPy RAG system—such as model calls, inputs, outputs, and traces—are automatically logged, providing you with a complete view of your workflows. This helps bring structure and visibility to the inherently experimental process of developing AI systems, ensuring you can debug efficiently and build robust evaluations without introducing unnecessary overhead.
To configure DSPy with Weave, you only need to import the Weave library and initialize it within your project. By calling weave.init("project_name") at the start of your DSPy program, Weave will automatically capture and log all operations within your RAG pipeline. This includes every interaction between modules, retrievals, and generations, as well as detailed traces of the inputs and outputs. This integration ensures that all stages of experimentation, evaluation, and production are organized and accessible in one place, facilitating smooth iteration and tracking of progress.
With Weave integrated into DSPy, you gain the ability to monitor model behavior in real time, debug issues more effectively, and build rigorous evaluations by comparing the performance of different models or configurations. After importing and initializing Weave, then running your DSPy RAG pipeline, you can open Weave, and visualize the performance of your system.
Here's a sample log from one of my RAG systems:

As you can see, the entire sequence of DSPy calls is logged to Weave, allowing us to easily analyze how the system is performing!
Building a RAG pipeline with DSPy
Creating an effective RAG pipeline with DSPy requires configuring both the retrieval and language models to work together cohesively. Unlike language models, DSPy does not include a native vector database retriever infrastructure. This means you’ll need to connect to an external vector database through APIs.
Weaviate, a powerful vector search engine, is an excellent choice for building this infrastructure. You can configure Weaviate as your retrieval module within DSPy to fetch relevant passages for queries.
Setting up Weaviate for your RAG pipeline
To get started with Weaviate, you’ll need to create an account and set up a cluster. Weaviate offers a free 14-day trial, ideal for experimentation. Once your cluster is set up, you’ll need the cluster URL to connect your DSPy pipeline to the Weaviate instance. After creating a Weaviate account, you can navigate to "Create a Cluster" which will direct you to the screen shown below, and you can create a "Sandbox" cluster.

Uploading documents to Weaviate
For this tutorial, we will use a portion of the W&B Weave documentation as the data to be uploaded into Weaviate. Below is the code that downloads the documentation, processes it into smaller chunks, and uploads it to your Weaviate cluster using Cohere embeddings.
We’ll also configure Cohere as the embedding module within Weaviate. Cohere provides powerful multilingual embeddings, which are ideal for transforming text into dense vector representations. These embeddings allow your vector database to store and retrieve meaningful chunks of text based on semantic similarity, ensuring your RAG pipeline pulls the most relevant information.
import osimport reimport globimport weaviateimport weaviate.classes.config as wvcc# Function to download the Weave documentation .md filesdef download_weave_docs():repo_dir = "weave_docs"if not os.path.exists(repo_dir):print("Downloading Weave documentation...")os.system(f"git clone --depth 1 --filter=blob:none --sparse https://github.com/wandb/weave.git {repo_dir}")os.chdir(repo_dir)os.system("git sparse-checkout set docs/docs/guides/tracking")os.chdir("..")print("Weave documentation downloaded successfully.")else:print(f"{repo_dir} already exists, skipping download.")# Function to read and process the .md filesdef process_md_files():md_files = glob.glob('weave_docs/docs/docs/guides/tracking/*.md')all_chunks = []for md_file in md_files:with open(md_file, 'r', encoding='utf-8') as f:content = f.read()# Clean the content if necessarycontent = clean_markdown(content)# Split content into sentences or paragraphschunks = split_into_chunks(content, chunk_size=200)all_chunks.extend(chunks)return all_chunks# Function to clean markdown contentdef clean_markdown(text):"""Remove unnecessary elements but retain code, symbols, and structure."""text = re.sub(r'!\[.*?\]\(.*?\)', '', text) # Remove imagestext = re.sub(r'\[([^\]]+)\]\((.*?)\)', r'\1', text) # Keep link text but remove URLstext = text.strip()return text# Function to split text into chunks of specified character lengthdef split_into_chunks(text, chunk_size=200):words = text.split()chunks = []current_chunk = []current_length = 0for word in words:current_chunk.append(word)current_length += len(word) + 1 # +1 for spaceif current_length >= chunk_size:chunks.append(' '.join(current_chunk))current_chunk = []current_length = 0if current_chunk:chunks.append(' '.join(current_chunk))return chunks# Main executionif __name__ == "__main__":# Step 1: Download Weave documentationdownload_weave_docs()# Step 2: Process the .md filesprint("Processing .md files...")content_chunks = process_md_files()print(f"Total content chunks: {len(content_chunks)}")# Step 3: Connect to Weaviate instanceclient = weaviate.connect_to_wcs(cluster_url="your weaviate cluster url ", # Replace with your WCS URLauth_credentials=weaviate.auth.AuthApiKey("your weaviate auth key"), # Replace with your WCS keyheaders={'X-Cohere-Api-Key': "your cohere api key" # Replace with your Cohere API key})try:# CAUTION: Running this will delete the collection along with the objects# Delete the collection if it existscollection_name = "WeaveDocsChunk"if client.collections.exists(collection_name):client.collections.delete(collection_name)print(f"Existing {collection_name} collection deleted.")# Create the collection with the specified vectorizer configuration and propertiescollection = client.collections.create(name=collection_name,vectorizer_config=wvcc.Configure.Vectorizer.text2vec_cohere(model="embed-multilingual-v3.0"),properties=[wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),wvcc.Property(name="source", data_type=wvcc.DataType.TEXT),])print(f"{collection_name} collection created successfully.")# Import Objectscollection_ref = client.collections.get(collection_name)for idx, content_chunk in enumerate(content_chunks):collection_ref.data.insert(properties={"content": content_chunk,"source": "WeaveDocs"})print(f"Uploaded chunk {idx + 1}/{len(content_chunks)}")print("Content chunks successfully imported into Weaviate!")finally:# Close the client connectionclient.close()print("Client connection closed.")
In this example, we use Cohere’s text2vec embeddings to convert chunks of documentation into meaningful vector representations. These embeddings allow Weaviate to store the chunks in a way that enables efficient similarity-based retrieval. This setup ensures that your RAG pipeline retrieves the most contextually relevant information during queries.
By configuring DSPy with this retrieval model, your pipeline will leverage both Weaviate and Cohere to dynamically fetch relevant documents. With the documents uploaded, the next step involves setting up the DSPy settings to connect the language model and retriever seamlessly. This powerful combination ensures that your pipeline delivers accurate, context-aware responses based on the latest data.
Building out the RAG pipeline with DSPy
Now that we have created a vector database with Weaviate, we are ready to move on to building out a RAG pipeline with DSPy.
To start, we configure a custom retrieval model that integrates with Weaviate’s API. DSPy offers multiple out-of-the-box retrieval modules, such as ColBERTv2, Pinecone, and AzureCognitiveSearch, but also allows you to create custom retrieval clients. These clients are responsible for handling queries and returning the top-k passages relevant to each query, making them essential components in any RAG pipeline. Here’s how you can implement a custom retrieval model with Weaviate.
First, we begin by initializing a Weaviate client and configuring it to connect with your Weaviate cloud instance. This client will enable your RAG system to retrieve relevant data from a specified collection:
import weaviateimport weave; weave.init("dspy-inference")# Connect to Weaviate cloud instanceclient = weaviate.connect_to_wcs(cluster_url="your-cluster-url", # Replace with your WCS URLauth_credentials=weaviate.auth.AuthApiKey("your-api-key"), # Replace with your WCS API keyheaders={'X-Cohere-Api-Key': "your-api-key"} # Replace with your Cohere API key)# Create a WeaviateRM client for retrievalretriever_model = WeaviateRM(weaviate_collection_name="WeaveDocsChunk",weaviate_client=client,k=5)
Here, the WeaviateRM module acts as the retriever, accessing your vector database to fetch top-k passages for any given query. This module, along with the language model, forms the backbone of your RAG pipeline. Now let’s configure the language model, using GPT-4 mini as an example:
# Configure Language Modelllm = dspy.OpenAI(model='gpt-4o-mini',api_key="your-api-key")
With both the language model and the retrieval model in place, we link them together through DSPy settings. This configuration ensures the retriever and generator work in sync, with the retriever fetching relevant context and the language model generating the final answer:
# Configure dspy settingsdspy.settings.configure(lm=llm, rm=retriever_model)
Now, let's define the signatures and the RAG module to utilize our retrieval and language model setup effectively:
class GenerateAnswer(dspy.Signature):"""Answer questions with short factoid answers."""context = dspy.InputField(desc="Relevant facts")question = dspy.InputField()answer = dspy.OutputField(desc="Short answer")class RAG(dspy.Module):def __init__(self, num_passages=3):super().__init__()self.retrieve = dspy.Retrieve(k=num_passages)self.generate_answer = dspy.ChainOfThought(GenerateAnswer)def forward(self, question: str):# Retrieve relevant passagescontext = self.retrieve(question).passages# Generate an answer using the retrieved contextprediction = self.generate_answer(context=context, question=question)return dspy.Prediction(context=context, answer=prediction.answer)
This setup defines the GenerateAnswer signature which specifies the fields involved in generating an answer based on retrieved context. The RAG module orchestrates the flow from retrieving data using the custom Weaviate retriever to generating answers using the configured GPT-4 model. This structure allows your RAG pipeline to perform full-cycle operations, from fetching relevant information to delivering precise answers.
By combining Weaviate as a retriever with a language model like GPT-4 mini, you build a RAG pipeline capable of retrieving relevant information and generating accurate, context-aware responses. DSPy’s flexible configuration allows you to swap retrievers or language models easily, giving you the ability to experiment with different setups without reworking the core logic.
Compiling and executing the RAG program
We will now walk through the process of compiling a Retrieval-Augmented Generation (RAG) pipeline using DSPy. To ensure the optimizer can effectively improve the program’s performance, we need both a training set and a test set. The training set helps the optimizer fine-tune the parameters, while the test set evaluates how well the system generalizes to unseen data. For this tutorial, we split a dataset into training and evaluation sets, with the formatted examples guiding the optimizer. Below is an outline of the full setup, which will help you build and compile the RAG pipeline efficiently.
The dataset is split using train_test_split into a training set of 20 examples and a test set of 10 examples. Each example contains a question and a reference answer. This structured format ensures the optimizer can evaluate predictions against the ground truth effectively. We also define the GenerateAnswer signature, which outlines the expected input-output structure, such as the input question and context, and the expected short answer. This guides the RAG module to retrieve relevant content and generate coherent responses.
The RAG pipeline is built by defining the RAG module, which incorporates the retrieval and generation logic. The retrieval component fetches the most relevant passages based on the input question, and the generation component produces an answer using the retrieved context. To ensure the pipeline is optimized, we introduce a custom metric for factual correctness. This metric compares the generated response against the reference answer to determine its factual accuracy, ensuring the model outputs reliable information.
Here's the code for compiling our optimizer:
# Step 1: Import necessary libraries and modulesimport dspyfrom dspy.teleprompt import BootstrapFewShotfrom dspy.evaluate.evaluate import Evaluatefrom sklearn.model_selection import train_test_splitimport pandas as pdimport randomimport weaveimport weaviatefrom dspy.retrieve.weaviate_rm import WeaviateRMfrom dspy.primitives.example import Example# Import Ragas metric for factual correctnessfrom ragas.llms import LangchainLLMWrapperfrom ragas.dataset_schema import SingleTurnSamplefrom ragas.metrics._factual_correctness import FactualCorrectnessfrom langchain_openai import ChatOpenAIimport asyncioimport nest_asyncio# Apply nest_asyncio to avoid event loop conflicts in Weavenest_asyncio.apply()weave.init(project_name="dspy")SEED = 42random.seed(SEED)dataset_path = './generated_testset.csv'# Step 2: Configure the Language Model (LM) and Retrieval Model (RM)llm = dspy.OpenAI(model='gpt-4o-mini', api_key="sk-")client = weaviate.connect_to_wcs(cluster_url="", # Replace with your WCS URLauth_credentials=weaviate.auth.AuthApiKey(""), # Replace with your WCS keyheaders={'X-Cohere-Api-Key': "" # Replace with your Cohere API key})retriever_model = WeaviateRM(weaviate_collection_name="WeaviateBlogChunk", # Use 'class_name' instead of 'collection_name'weaviate_client=client,k=5 # Number of top results to retrieve)dspy.settings.configure(lm=llm, rm=retriever_model)# Initialize the evaluator LLM using Langchain and OpenAI GPT-4 modelevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", openai_api_key="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"))# Step 3: Load and Format Your Custom Datasetdf = pd.read_csv("/Users/brettyoung/Desktop/dev_24/tutorials/ds_py/v2/generated_testset.csv")def format_dataset(df):"""Format the dataset into DSPy-compatible examples with initialized inputs."""examples = []for _, row in df.iterrows():example = Example({'question': row['user_input'],'answer': row['reference']}).with_inputs('question') # Input key as 'question'examples.append(example)return examples# Split the dataset into training and evaluation setstrain_df, eval_df = train_test_split(df, train_size=20, test_size=10, random_state=SEED)trainset = format_dataset(train_df)devset = format_dataset(eval_df)print(f"Trainset Size: {len(trainset)}, Devset Size: {len(devset)}")print(f"First Trainset Example: {trainset[0]}")# Step 4: Define the Signatures for the RAG Pipelineclass GenerateAnswer(dspy.Signature):"""Answer questions with 1-3 sentence answers."""context = dspy.InputField(desc="may contain relevant facts")question = dspy.InputField()answer = dspy.OutputField(desc="Answer in 1-3 sentences")# Step 5: Build the RAG Pipelineclass RAG(dspy.Module):def __init__(self, num_passages=5):super().__init__()self.retrieve = dspy.Retrieve(k=num_passages)self.generate_answer = dspy.ChainOfThought(GenerateAnswer)def forward(self, question):context = self.retrieve(question).passagesprediction = self.generate_answer(context=context, question=question)return dspy.Prediction(context=context, answer=prediction.answer)# Step 6: Define the Factual Correctness Metricdef factual_correctness_metric(example, pred, trace=None):"""Use Ragas factual correctness metric."""# Extract response and reference correctlyresponse = pred.answer # The model-generated answerreference = example.answer # The ground truth reference# Create the sample in the correct formatsample = SingleTurnSample(response=response,reference=reference)# Initialize and calculate the factual correctness scorefactual_correctness_metric = FactualCorrectness(llm=evaluator_llm)loop = asyncio.get_event_loop()score = loop.run_until_complete(factual_correctness_metric.single_turn_ascore(sample=sample))score = factual_correctness_metric.single_turn_score(sample=sample)print("#"*20); print(score); print("#"*20)# Return numerical score (0-1 range) for optimizationreturn score# Step 7: Compile the RAG Program with the Custom Metric and Teleprompterteleprompter = BootstrapFewShot(metric=factual_correctness_metric, # Use the factual correctness metricmetric_threshold=0.7 # Accept scores >= 0.5)compiled_rag = teleprompter.compile(RAG(), trainset=trainset)# Optional: Save the compiled RAG programsave_path = './compiled_rag_program_v1.json'compiled_rag.save(save_path)print(f"RAG Program compiled and saved to {save_path}.")
The compilation process leverages the BootstrapFewShot optimizer, which uses the factual correctness metric to guide the optimization process. The optimizer runs iterative evaluations, ensuring that predictions with a score above the threshold of 0.75 are accepted, refining the pipeline for optimal performance. After compilation, the RAG program is saved for future use, ensuring the system can be further refined or deployed without re-training the entire model.
With this setup complete, the RAG pipeline is ready to generate accurate and context-aware responses. The combination of carefully structured signatures, optimized retrieval and generation modules, and a reliable evaluation metric ensures the pipeline performs effectively. The compiled version is saved as compiled_rag_program_v1.json, allowing for easy re-deployment and further optimization in future iterations.
Evaluating performance with Weave evaluations
In this section, we evaluate the performance of our RAG pipeline using Weave Evaluations. This allows us to measure the effectiveness of our pipeline based on key metrics such as factual correctness. We'll use a test dataset to compare expected outputs with those generated by the RAG system, ensuring the pipeline behaves as expected and delivers high-quality responses.
Below is the complete code to set up and execute the evaluation process. This includes loading the dataset, formatting it for evaluation, and defining the necessary evaluation functions using Weave and RAGAS metrics.
import osimport asyncioimport pandas as pdimport randomimport weavefrom weave import Evaluation, Model# Import your RAG pipeline componentsfrom dspy.primitives.example import Exampleimport dspyfrom dspy.datasets import HotPotQAfrom dspy.teleprompt import BootstrapFewShotfrom dspy.evaluate.evaluate import Evaluatefrom sklearn.model_selection import train_test_splitimport weaviatefrom dspy.retrieve.weaviate_rm import WeaviateRM# Import RAGAS metrics and related classesfrom ragas.llms import LangchainLLMWrapperfrom langchain.chat_models import ChatOpenAIfrom ragas.dataset_schema import SingleTurnSamplefrom ragas.metrics import FactualCorrectnessfrom weave.trace.box import unboximport nest_asyncioimport mathimport weave; weave.init("ragas_eval")# Seed for reproducibilitySEED = 42random.seed(SEED)# Apply nest_asyncio to avoid event loop conflicts in Weavenest_asyncio.apply()# Configure Language Model (LM) and Retriever Model (RM)turbo = dspy.OpenAI(model='gpt-4o-mini', api_key="your api key")client = weaviate.connect_to_wcs(cluster_url="your cluser url",auth_credentials=weaviate.auth.AuthApiKey("your weaviate api key"),headers={'X-Cohere-Api-Key': "your cohere api key"})retriever_model = WeaviateRM(weaviate_collection_name="WeaveDocsChunk",weaviate_client=client,k=5)# Configure DSPy settingsdspy.settings.configure(lm=turbo, rm=retriever_model)class GenerateAnswer(dspy.Signature):"""Answer questions with short factoid answers."""context = dspy.InputField(desc="may contain relevant facts")question = dspy.InputField()answer = dspy.OutputField(desc="Answer in 1-3 sentences")class RAG(dspy.Module):def __init__(self, num_passages=5):super().__init__()self.retrieve = dspy.Retrieve(k=num_passages)self.generate_answer = dspy.ChainOfThought(GenerateAnswer)def forward(self, question: str):context = self.retrieve(question).passagesif isinstance(question, weave.trace.box.BoxedStr):question = unbox(question)prediction = self.generate_answer(context=context, question=question)return dspy.Prediction(context=context, answer=prediction.answer)class RAGModel(Model):model_name: strrag_pipeline: RAG@weave.opdef predict(self, question: str):print(rag_pipeline)pred = self.rag_pipeline(question)return {"output": pred.answer,"input": question,"retrieved_contexts": pred.context}def safe_score(score):"""Helper function to return 0 if the score is NaN or None."""if score is None or (isinstance(score, float) and math.isnan(score)):return 0return score@weave.opdef ragas_factual_correctness_score(expected: str, question: str, model_output: dict) -> dict:if not model_output:print("Model output is None.")return {'factual_correctness_score': 0}sample = SingleTurnSample(response=model_output.get('output', "No output."),reference=expected)evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))factual_correctness_metric = FactualCorrectness(llm=evaluator_llm)loop = asyncio.get_event_loop()score = loop.run_until_complete(factual_correctness_metric.single_turn_ascore(sample=sample))return {'factual_correctness_score': safe_score(score)}def format_dataset(df):"""Format the dataset into a list of dictionaries for Weave evaluation."""evaluation_dataset = []for _, row in df.iterrows():entry = {"question": row['user_input'],"expected": row['reference']}evaluation_dataset.append(entry)return evaluation_datasetif __name__ == "__main__":dataset_path = './generated_testset.csv'df = pd.read_csv(dataset_path)eval_data = format_dataset(df)save_path = './compiled_rag_program_v1.json'rag_pipeline = RAG()rag_pipeline.load(path=save_path)rag_model = RAGModel(model_name='RAG Model', rag_pipeline=rag_pipeline)evaluation = Evaluation(dataset=eval_data, scorers=[ragas_factual_correctness_score])print("Starting Weave evaluation...")asyncio.run(evaluation.evaluate(rag_model))
The pipeline setup defines the retrieval and generation flow through the RAG class, with the RAGModel class serving as the entry point for predictions. The format_dataset function converts the test dataset into a structured format required for evaluation. For evaluation logic, we utilize a factual correctness metric from the RAGAS library, which compares the output of the model with the expected answer to ensure factual alignment. We initialize the Evaluation object with the formatted dataset and scorer, then run the evaluation asynchronously to assess the model’s performance.
After running this script, you will be able to visualize the results in the Weave comparisons dashboard. This dashboard provides a side-by-side comparison of key metrics such as token accuracy and overall accuracy for both models, making it easy to analyze their performance.
By selecting the evaluation runs in the Weights & Biases project and clicking on the 'Compare' button, you can generate detailed visualizations of how the model is performing in the evaluation. This comparison helps you identify areas where the models perform well and where further optimizations are needed.
Here's a screenshot of what it looks like inside Weave after running your evaluation.

The results from this evaluation will indicate how well the RAG pipeline performs in terms of factual correctness. If needed, you can enhance this process by introducing additional metrics such as faithfulness or relevance, based on your specific use case.
Conclusion
This tutorial demonstrated how RAG improves language models by incorporating external data sources, ensuring responses are relevant, accurate, and up-to-date. We explored how DSPy simplifies the creation and optimization of RAG pipelines by offering a structured approach to prompt engineering, making it easier to build, maintain, and refine complex AI workflows. Additionally, we integrated Weave evaluations to monitor and assess the performance of the pipeline, ensuring that the system generates reliable outputs and minimizes errors like outdated information or hallucinations.
By combining DSPy’s flexibility with Weave’s evaluation capabilities, you can efficiently build and fine-tune high-performing RAG systems. Whether you need to automate documentation through chatbots or extract real-time sentiment from financial data, mastering these tools ensures your AI solutions are both functional and reliable, capable of adapting to evolving needs of your business!
Building an LLM Python debugger agent with the new Claude 3.5 Sonnet
Building a AI powered coding agent with Claude 3.5 Sonnet!
How to train and evaluate an LLM router
This tutorial explores LLM routers, inspired by the RouteLLM paper, covering training, evaluation, and practical use cases for managing LLMs effectively.
How to evaluate a Langchain RAG system with RAGAs
A guide to evaluating a Langchain RAG system with RAGAs and Weights & Biases.
Vision fine-tuning GPT-4o on a custom dataset
Learn to vision fine-tune GPT-4o on a custom dataset, with evaluation and tracking.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.