How to optimize LLM workflows using DSPy and W&B Weave

Learn how to use DSPy teleprompters and Weave to automatically optimize prompting strategies for causal reasoning
Created on August 6|Last edited on August 8
Comment
﻿
🎬 IntroductionIn this report, we'll show you how to improve the performance of our LLM workflow implemented on the causal judgment task from the BIG-bench Hard benchmark and evaluate our prompting strategies. We'll use DSPy to implement our LLM workflow and optimize our prompting strategy and Weave—W&B's lightweight tool to develop LLM applications﻿—to track our LLM workflow and evaluate our prompting strategies. so a word on that quickly. It stands for 
You might not be familiar with BIG-bench so a word on that quickly. It stands for Beyond the Imitation Game Benchmark and is a collaborative benchmark of more than 200 tasks intended to probe large language models and extrapolate their future capabilities. The BIG-Bench Hard (BBH) is a suite of 23 of the most challenging BIG-Bench tasks that can be pretty difficult to solve using the current generation of language models.
You can follow along with this tutorial in the Google Colab if you'd like to try any of this yourself: 
﻿
﻿
﻿
﻿
Table of contents🎬 IntroductionTable of contents⚙️ Tools we need for this project🧶 Enable tracking using Weave💽 Load the BIG-Bench Hard dataset🤖 Implementing the LLM workflow as a DSPy program✍️ Writing a baseline causal reasoning program⚖️ Evaluating the baseline DSPy program🦾 Optimizing our DSPy program🏁 Conclusion📕 Further Resources
﻿
⚙️ Tools we need for this project﻿
Installing the libraries and getting the OpenAI API Key2
﻿
🧶 Enable tracking using Weave﻿
Enable tracking and metadata management using Weave1
﻿
💽 Load the BIG-Bench Hard datasetWe'll load this dataset from HuggingFace Hub, split it into training and validation sets, and publish﻿﻿ them on Weave. This lets us version the datasets and use the Weave Evaluation API to evaluate our prompting strategy.
import dspy
from datasets import load_dataset
﻿
@weave.op()
def get_dataset(metadata: Metadata):
    # load the BIG-Bench Hard dataset corresponding to the task from Huggingface Hug
    dataset = load_dataset(metadata.dataset_address, metadata.big_bench_hard_task)["train"]
﻿
    # create the training and validation datasets
    rows = [{"question": data["input"], "answer": data["target"]} for data in dataset]
    train_rows = rows[0:metadata.num_train_examples]
    val_rows = rows[metadata.num_train_examples:]
﻿
    # create the training and validation examples consisting of `dspy.Example` objects
    dspy_train_examples = [dspy.Example(row).with_inputs("question") for row in train_rows]
    dspy_val_examples = [dspy.Example(row).with_inputs("question") for row in val_rows]
﻿
    # publish the datasets to the Weave, this would let us version the data and use for evaluation
    weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_train", rows=train_rows))
    weave.publish(weave.Dataset(name=f"bigbenchhard_{metadata.big_bench_hard_task}_val", rows=val_rows))
﻿
    return dspy_train_examples, dspy_val_examples
﻿
﻿
The datasets, once published, can be explored in the Weave UI1
﻿
🤖 Implementing the LLM workflow as a DSPy programDSPy is a framework that pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs), where a compiler automatically generates optimized LM invocation strategies and prompts from a program.
According to the DSPy programming model, string-based prompting techniques are first translated into declarative modules with natural-language typed signatures. Then, each module is parameterized to learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline.
Check the following papers to learn more about the DSPy paradigm:
﻿DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines﻿
﻿DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines﻿
﻿Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs﻿
We're going to use the dspy.OpenAI abstraction to make LLM calls to GPT3.5 Turbo.
system_prompt = """
You are an expert in the field of causal reasoning.
You are to analyze the a given question carefully and answer in `Yes` or `No`.
You should also provide a detailed explanation justifying your answer.
"""
﻿
llm = dspy.OpenAI(model="gpt-3.5-turbo", system_prompt=system_prompt)
dspy.settings.configure(lm=llm)
✍️ Writing a baseline causal reasoning program﻿
A baseline DSPy program for causal reasoning1
﻿
﻿
⚖️ Evaluating the baseline DSPy programNow that we have a baseline prompting strategy let's evaluate it on our validation set using the Weave Evaluation API and a straightforward metric that matches the predicted answer with the ground truth. Weave will take each example, pass it through your application, and score the output on multiple custom scoring functions. Doing this gives you a view of your application's performance and a rich UI to drill into individual outputs and scores.
﻿
Evaluating the Baseline DSPy Program2
﻿
Note that running the evaluation will cost approximately $0.24 in OpenAI credits. Not bad! Weave shows you the cost of all your traces and evaluations, which helps you keep track of the cost of running your LLM experiments and operating your LLM workflow in production.
﻿
🦾 Optimizing our DSPy program﻿
Optimizing the DSPy Program1
﻿
🏁 ConclusionWe've learned how to use Weave to track and evaluate our LLM workflows.
We've learned to implement our LLM workflows and prompting strategies as DSPy Programs.
We've learned to optimize our LLM programs for causal reasoning using DSPy Teleprompters and evaluate them using Weave.
📕 Further ResourcesWe have a free prompt engineering course here to help you think about how to structure your prompts. Also, feel free to check out the following reports to learn more about developing LLM applications
﻿
Building an AI teacher's assistant using LlamaIndex and Groq
Today, we're going to leverage a RAG pipeline to create an AI TA capable of helping out with grading, questions about a class syllabus, and more
Refactoring Wandbot—our LLM-powered document assistant—for improved efficiency and speed
This report tells the story of how we utilized auto-evaluation-driven development to enhance both the quality and speed of Wandbot.
GPT-4o Python quickstart using the OpenAI API
Getting set up and running GPT-4o on your machine in Python using the OpenAI API.
How to use the Gemini Pro API with W&B Weave
Powerful LLMs need observability. Here's how to get it. 
﻿
﻿
Add a comment
Tags: Articles, GenAI, Weave, LLM, Intermediate, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.