How to optimize LLM workflows using DSPy and W&B Weave
Learn how to use DSPy teleprompters and Weave to automatically optimize prompting strategies for causal reasoning
Created on August 6|Last edited on August 8
Comment
🎬 Introduction
In this report, we'll show you how to improve the performance of our LLM workflow implemented on the causal judgment task from the BIG-bench Hard benchmark and evaluate our prompting strategies. We'll use DSPy to implement our LLM workflow and optimize our prompting strategy and Weave—W&B's lightweight tool to develop LLM applications—to track our LLM workflow and evaluate our prompting strategies. so a word on that quickly. It stands for
You might not be familiar with BIG-bench so a word on that quickly. It stands for Beyond the Imitation Game Benchmark and is a collaborative benchmark of more than 200 tasks intended to probe large language models and extrapolate their future capabilities. The BIG-Bench Hard (BBH) is a suite of 23 of the most challenging BIG-Bench tasks that can be pretty difficult to solve using the current generation of language models.
You can follow along with this tutorial in the Google Colab if you'd like to try any of this yourself:

Table of contents
🎬 IntroductionTable of contents⚙️ Tools we need for this project🧶 Enable tracking using Weave💽 Load the BIG-Bench Hard dataset🤖 Implementing the LLM workflow as a DSPy program✍️ Writing a baseline causal reasoning program⚖️ Evaluating the baseline DSPy program🦾 Optimizing our DSPy program🏁 Conclusion📕 Further Resources
⚙️ Tools we need for this project
Installing the libraries and getting the OpenAI API Key
2
🧶 Enable tracking using Weave
Enable tracking and metadata management using Weave
1
💽 Load the BIG-Bench Hard dataset
We'll load this dataset from HuggingFace Hub, split it into training and validation sets, and publish them on Weave. This lets us version the datasets and use the Weave Evaluation API to evaluate our prompting strategy.
import dspyfrom datasets import load_dataset@weave.op()def get_dataset(metadata: Metadata):# load the BIG-Bench Hard dataset corresponding to the task from Huggingface Hugdataset = load_dataset(metadata.dataset_address, metadata.big_bench_hard_task)["train"]# create the training and validation datasetsrows = [{"question": data["input"], "answer": data["target"]} for data in dataset]train_rows = rows[0:metadata.num_train_examples]val_rows = rows[metadata.num_train_examples:]# create the training and validation examples consisting of `dspy.Example` objectsdspy_train_examples = [dspy.Example(row).with_inputs("question") for row in train_rows]dspy_val_examples = [dspy.Example(row).with_inputs("question") for row in val_rows]# publish the datasets to the Weave, this would let us version the data and use for evaluationweave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_train", rows=train_rows))weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_val", rows=val_rows))return dspy_train_examples, dspy_val_examples
The datasets, once published, can be explored in the Weave UI
1
🤖 Implementing the LLM workflow as a DSPy program
DSPy is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs), where a compiler automatically generates optimized LM invocation strategies and prompts from a program.
According to the DSPy programming model, string-based prompting techniques are first translated into declarative modules with natural-language typed signatures. Then, each module is parameterized to learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline.
Check the following papers to learn more about the DSPy paradigm:
system_prompt = """You are an expert in the field of causal reasoning.You are to analyze the a given question carefully and answer in `Yes` or `No`.You should also provide a detailed explanation justifying your answer."""llm = dspy.OpenAI(model="gpt-3.5-turbo", system_prompt=system_prompt)dspy.settings.configure(lm=llm)
✍️ Writing a baseline causal reasoning program
A baseline DSPy program for causal reasoning
1
⚖️ Evaluating the baseline DSPy program
Now that we have a baseline prompting strategy let's evaluate it on our validation set using the Weave Evaluation API and a straightforward metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application, and score the output on multiple custom scoring functions. Doing this gives you a view of your application's performance and a rich UI to drill into individual outputs and scores.
Evaluating the Baseline DSPy Program
2
Note that running the evaluation will cost approximately $0.24 in OpenAI credits. Not bad! Weave shows you the cost of all your traces and evaluations, which helps you keep track of the cost of running your LLM experiments and operating your LLM workflow in production.

🦾 Optimizing our DSPy program
Optimizing the DSPy Program
1
🏁 Conclusion
- We've learned to optimize our LLM programs for causal reasoning using DSPy Teleprompters and evaluate them using Weave.
📕 Further Resources
We have a free prompt engineering course here to help you think about how to structure your prompts. Also, feel free to check out the following reports to learn more about developing LLM applications
Building an AI teacher's assistant using LlamaIndex and Groq
Today, we're going to leverage a RAG pipeline to create an AI TA capable of helping out with grading, questions about a class syllabus, and more
Refactoring Wandbot—our LLM-powered document assistant—for improved efficiency and speed
This report tells the story of how we utilized auto-evaluation-driven development to enhance both the quality and speed of Wandbot.
GPT-4o Python quickstart using the OpenAI API
Getting set up and running GPT-4o on your machine in Python using the OpenAI API.
How to use the Gemini Pro API with W&B Weave
Powerful LLMs need observability. Here's how to get it.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.