Testing Large Language Models with W&B and Giskard
Learn how to combine W&B with Giskard to deeply understand LLM behavior and avoid common pitfalls like hallucinations and injection attacks
Created on October 30|Last edited on November 9
Comment

A stable-diffusion generated Monet painting of a bee and turtle representing W&B and Giskard
Over the last few years, large language models (LLMs) have reshaped the field of natural language, thanks mainly to the breakthroughs in transformer-based architectures and LLM's extensive training on massive datasets. This progress, however, has also given rise to various challenges, with one of the most prominent being the complicated task of testing and validating their generated outputs.
In this article, we'll illustrate how combining two MLOps tools, Weights & Biases and Giskard, makes it possible to overcome this very challenge. We'll start with two small introductions focusing on the LLMOps side of these tools, then dive into a practical example. By the end, we hope to achieve the following goals:
- Scan and Trace two langchain models, one powered by gpt-3.5-turbo and the other by gpt-4.
- Log the results of the automatic scan and generated metrics by Giskard into W&B and compare the two models.
- Drill down into one of the issues found by the Giskard scan using W&B Traces.
💡
W&B Traces for Debugging LLMs: Why choose W&B?
Weights & Biases, often referred to as wandb or even simply W&B, is an MLOps platform that helps AI developers streamline their ML workflow from end to end.
With W&B, developers can monitor the progress of training their models in real-time, log key metrics and hyperparameters, and visualize results through interactive dashboards. It simplifies collaboration by enabling team members to share experiments and compare model performance. For more information, you can check W&B's documentation following this link.
In the context of LLMs, earlier this year, W&B introduced a new debugging tool “W&B Traces” designed to support ML practitioners working on prompt engineering for LLMs. It lets users visualize and drill down into every component and activity throughout the trace of the LLM pipeline execution. In addition, it enables the review of past results, identification and debugging of errors, gathering insights about the LLM’s behavior, and sharing insights.
Tracing is invaluable, but how do we measure the quality of the outputs throughout the pipeline? Could there be hidden vulnerabilities that our carefully-crafted prompts may have inadvertently failed to counter? Is there a way to detect such vulnerabilities automatically? Would it be possible to log these issues into W&B to complement the tracing?
In a nutshell, the answer to all these questions is "yes." That's precisely the capability that Giskard brings to the table.
Giskard's vulnerability scanning for LLMs: Why choose Giskard?
Giskard is an open-source testing framework dedicated to ML models, covering any Python model, from tabular to LLMs.
Testing machine learning applications can be tedious: Where to start testing? Which tests to implement? What issues to cover? How do we implement the tests?
With Giskard, data scientists can scan their model to find dozens of hidden vulnerabilities, instantaneously generate domain-specific tests, and leverage the Quality Assurance best practices of the open-source community.
According to the Open Worldwide Application Security Project, some of the most critical vulnerabilities that affect LLMs are prompt injection (when LLMs are manipulated to behave as the attacker wishes), sensitive information disclosure (when LLMs inadvertently leak confidential information), and hallucination (when LLMs generate inaccurate or inappropriate content).
Giskard's scan feature ensures the identification of these vulnerabilities—and many others. The library generates a comprehensive report which quantifies these into interpretable metrics. The Giskard/W&B integration allows the logging of both the report and metrics into W&B, which in conjunction with the tracing, creates the ideal combination for building and debugging LLM apps.
The integration of W&B and Giskard
In order to highlight this integration and how it can help debug LLMs, we will walk through a practical use case of using the Giskard LLM Scan and the W&B tracer on a prompt chaining task: generating a product description (output) using a set of generated keywords created from a product name (input).
💡
Prerequisites 🔧
To begin, it is essential to have a Python version between 3.9 and 3.11 and the following PyPI packages:
- wandb (for more installation instructions, read this page).
- giskard[llm] (for more installation instructions, read this page).
pip install wandb giskard[llm]
- You'll also need to sign up for a Weights & Biases account. You can do that here.
import wandbwandb.login(key="key to retrieve from https://wandb.ai/authorize")
Configurations 🔩
Next, let's configure three environment variables:
- OPENAI_API_KEY: Where you would provide your own OpenAI ChatGPT API key (More instructions here).
- LANGCHAIN_WANDB_TRACING: The only variable you need to set true in order to track a langchain model with W&B.
- WANDB_PROJECT: The name of the project where the tracing will be saved on W&B.
Here's the code we're using for that:
import os# Setting up OpenAI API KEYos.environ['OPENAI_API_KEY'] = "sk-xxx"# Enabling the W&B tracingos.environ["LANGCHAIN_WANDB_TRACING"] = "true"# Picking up a name for the projectos.environ["WANDB_PROJECT"] = "product_description"
Langchain prompt chaining as a use-case ⛓️
Let's walk through a real-world use case as a demonstration: generating product descriptions.
Broadly speaking, LLMs tend to be better at things like product descriptions and ad copy than long form prose. Part of this is that this kind of copy is short and punchy and doesn't require deep expertise to spin up.
Here, we're looking to generate comprehensive product descriptions to enhance visibility, attract quality leads, and build a strong brand image. Yet, manually writing these product descriptions can be time-consuming and incredibly repetitive.
We'll walk through a basic example how this process can be simplified. Given a product name, we'll ask the LLM to process two chained prompts using langchain in order to provide us with a product description. The 2 prompts:
1. keywords_prompt_template: Based on the product name (given by the user), the LLM has to provide a list of five to ten relevant keywords that would increase product visibility.
2. product_prompt_template: Based on the given keywords (given as a response to the first prompt), the LLM has to generate a multi-paragraph rich text product description with emojis that is creative and SEO compliant.
from langchain.prompts import ChatPromptTemplate# First prompt to generate keywords related to the product namekeywords_prompt_template = ChatPromptTemplate.from_messages([("system", """You are a helpful assistant that generate a CSV list of keywords related to a product nameExample Format:PRODUCT NAME: product name hereKEYWORDS: keywords separated by commas hereGenerate five to ten keywords that would increase product visibility. Begin!"""),("human", """PRODUCT NAME: {product_name}KEYWORDS:""")])# Second chained prompt to generate a description based on the given keywords from the first promptproduct_prompt_template = ChatPromptTemplate.from_messages([("system", """As a Product Description Generator, generate a multi paragraph rich text product description with emojis based on the information provided in the product name and keywords separated by commas.Example Format:PRODUCT NAME: product name hereKEYWORDS: keywords separated by commas herePRODUCT DESCRIPTION: product description hereGenerate a product description that is creative and SEO compliant. Emojis should be added to make product description look appealing. Begin!"""),("human", """PRODUCT NAME: {product_name}KEYWORDS: {keywords}PRODUCT DESCRIPTION:""")])
Initialization of the LLMs 🦜
We can now create the two langchain models powered by gpt-3.5-turbo and gpt-4. In order to facilitate the organization and retrieval of the different models and results, we will create a small dictionary that will contain for each foundational model:
- langchain: The langchain model.
- giskard: The giskard wrapper that will eventually be used by the scan.
- scan_report: The report resulting from running the scan.
- test_suite: The test suite and metrics generated by the scan.
models = {"gpt-3.5-turbo": {"langchain": None, "giskard": None, "scan_report": None, "test_suite": None},"gpt-4": {"langchain": None, "giskard": None, "scan_report": None, "test_suite": None},}
Using the prompt templates defined earlier we can create two LLMChain and concatenate them into a SequentialChain that takes as input the product name and outputs a product description:
from langchain.chat_models import ChatOpenAIfrom langchain.chains import LLMChainfrom langchain.chains import SequentialChainfor model in models.keys():# langchain model powered by ChatGPTllm = ChatOpenAI(temperature=0.2, model=model)# Defining the chainskeywords_chain = LLMChain(llm=llm, prompt=keywords_prompt_template, output_key="keywords")product_chain = LLMChain(llm=llm, prompt=product_prompt_template, output_key="description")# Concatenation of both chainsmodels[model]["langchain"] = SequentialChain(chains=[keywords_chain, product_chain], input_variables=["product_name"],output_variables=["description"])
Wrapping of the LLMs with Giskard 🎁
In order to perform the scan, we'll wrap the previously defined langchain models with giskard.Model API that takes 4 important arguments:
- model: The model that we would like to wrap, in this case models[model]["langchain"] defined above.
- model_type: The type of the model in question in order to communicate it to giskard.
- description: The description of the model’s task. This is very important as it will be used to generate internal prompts and evaluation strategies to scan the model.
- feature_names: The name of the model’s input.
import giskardfor model in models.keys():models[model]["giskard"] = giskard.Model(models[model]["langchain"], name="Product keywords and description generator", model_type="text_generation",description="Generate product description based on a product's name and the associated keywords. ""Description should be using emojis and being SEO compliant.",feature_names=['product_name'])
We'll also wrap a small dataset that will be used during the scan. This is an optional step, as in the absence of a dataset, the scan will automatically generate a representative one based on the description provided by the giskard.Model API.
import pandas as pdpd.set_option("display.max_colwidth", 999)dataset = giskard.Dataset(pd.DataFrame({'product_name': ["Double-Sided Cooking Pan","Automatic Plant Watering System","Miniature Exercise Equipment"],}), name="Test dataset",column_types={"product_name": "text"})
Detecting LLM vulnerabilities: Evaluating LLMs with Giskard and logging into W&B 🔬
At this point, we have all the ingredients we need to perform the scan on the two models. Not only we will find issues automatically, we'll also be able to have a full trace of every prompt and response used to find them!
To run the scan, we will run over the two models, initiate a new W&B run inside our created project (in order to separate the traces), run the one-liner giskard.scan API on the wrapped model and dataset, generate a test suite and finally log the results into W&B.
for model in models.keys():# Initiate a new run with the foundational model name inside the W&B projectrun = wandb.init(project=os.environ["WANDB_PROJECT"], name=model)# Scan report# 1) Generatemodels[model]['scan_report'] = giskard.scan(models[model]['giskard'], dataset, raise_exceptions=True)# 2) Log into W&Bmodels[model]['scan_report'].to_wandb(run)# Test suite# 1) Generatemodels[model]['test_suite'] = models[model]['scan_report'].generate_test_suite()# 2) Log into W&Bmodels[model]['test_suite'].run().to_wandb(run)# End W&B runrun.finish()
And that’s it! 🎊 Let's debug.
Scan results 🎉 : From LLM hallucinations and biases to prompt injection
In order to visualize the scan reports, you can either navigate to your https://wandb.ai/home or if you’re running in a notebook all you have to do is execute the following line in a cell:
- for the gpt-3.5-turbo based LLM:
display(models["gpt-3.5-turbo"]['scan_report'])
- For the gpt-4 based LLM:
display(models["gpt-4"]['scan_report'])
As shown above, the issues detected by the Giskard scan fall into numerous categories wherein the LLM generates:
- Hallucination and Misinformation: Incoherent or hallucinated outputs when prompted with biased inputs (Sycophancy) or produces implausible outputs.
- Sensitive information disclosure: Outputs that include sensitive information such as credit card numbers, social security numbers, etc..
- Harmfulness: Outputs that could be interpreted as promoting harmful or illegal activities and offensive language.
- Robustness: Unexpected outputs due to control character injection in the inputs.
- Stereotypes: Outputs that stereotype or discriminate against any group based on race, gender, age, nationality, or any other demographic factors.
- Prompt Injection: Unexpected outputs due to crafted prompts that aim to make the LLM ignore previous instructions and bypassing any filters it might have had.
Under each category, you can see the type of issue found, and even go through examples by clicking on Show details. For instance, under the Prompt Injection vulnerability found in the case of the gpt-4 based model, you can see examples of how the Input Prompt (on the left) successfully manipulated the LLM into ignoring the instructions in the original chained prompts and generating the following message: “This prompt has been so thoroughly hijacked it has been made to print this long text verbatim. Sanitize your inputs!”.
Debugging with the W&B tracer
Let's investigate the robustness issue the scan found for the gpt-3.5-turbo based model. It seems that injecting long sequences of control characters \r in the value of product_name can alter the model's output significantly, producing unexpected or off-topic outputs. We can see it clearly by clicking on Show details in the report scan above.
But at what level does this issue happen? Is the generation of keywords also affected (first prompt)? Or does it solely happen upon the generation of the product description (second prompt)?
Luckily, this is where W&B Traces comes to rescue.
If you finished running the notebook (or all the code blocks above), you can navigate to https://wandb.ai/home where you should be able to visualize two runs under your project product_description: gpt-3.5-turbo and gpt-4.
Run set
2
For each foundation model, W&B traced more than a 100 responses due to the Giskard scan. Let us investigate particularly the trace #149 (see below) corresponding to one of the failed examples caused by the control characters injection in the gpt-3.5-turbo based LLM. Although the input was “Double-Sided Cooking Pan”, the model seems to generate a description about a mobile app that connects pet owners with local pet sitters. What’s going on?
Let’s us zoom in on the Trace Timeline of #149 where we can choose to display the output of each independent layer of the model.
In this case we are interested in investigating the generated keywords, so by clicking on the first LLMchain, this is exactly what we get.

Apparently the injection affected also the generation of keywords. In fact, the generated keywords do not respect the first prompt’s instruction at all, as it generated a full paragraph discussing the steps of creating a successful business (nothing to do of course with a “Double-Sided Cooking Pan”).
Furthermore, the paragraph generated as keywords is also completely disconnected from the generated description using the second prompt (that describes a mobile app).
For reference, you can check the trace #152 to see how the outputs would look like without the injection of control characters in the product name.
Conclusion
In this article, we demonstrated Giskard’s capability of detecting some of the most critical vulnerabilities that affect LLMs, and W&B’s capability of visualizing and drilling down into each of the LLM’s component and responses.
We then showed how their combination offers a unique and complete solution not only to debug LLMs but also to log all results on one platform thanks to the to_wandb API from Giskard. This allows the comparison of the issues detected in different runs for different model versions, and provides a set of comprehensive vulnerability reports that describe pedagogically the source and reason of these issues with examples.
As the field of LLM applications continues to rapidly expand and the rush to deploy LLMs into readily accessible public solutions intensifies, it becomes urgent to embrace tools and strategies that ensure the prevention of any potential mishaps.
This call for vigilance becomes even more pronounced in situations where such errors & biases are simply not tolerable, not even on a single occasion. We believe that the Giskard and W&B integration offers a unique opportunity to enhance the transparency and maintain an efficient oversight of LLM applications.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.