Skip to main content

W&B Best Practices Guide for LLMs

Created on January 12|Last edited on January 22


W&B Experiments and Logging

SDK Installation and Login

To start using W&B, you first need to install the Python package (if it's not already there)
pip install wandb
Once it's installed, authenticate your user account by logging in through the CLI or SDK. You should have receive an email to sign up to the platform, after which you can obtain your API token
wandb login --host <YOUR W&B HOST URL> <YOUR API TOKEN>
OR through Python:
wandb.login(host=os.getenv("WANDB_BASE_URL"), key=os.getenv("WANDB_API_KEY"))
Once you are logged in, you are ready to track your workflows!

W&B Runs

At the core of W&B is a Run, which is a logged unit of execution of Python code. A Run captures the entire execution context of that unit: Python library versions, hardware info, system metrics, git state, etc. It is best to create a run for each granular piece of your pipeline, so it's easy to identify the inputs and outputs of a given piece of code. For example, you might have a Run for each of:
  • LLM Inference on a dataset
  • Creation of embeddings from some documents
  • GPT4 Evaluation of LLM outputs
To create a run, call wandb.init(). There are a bunch of important arguments you can pass to wandb.init() to provide additional context for the run and enable you to organize your runs later:
import wandb

wandb.init(project="prompt-playground",
entity="wandb-smle", # Team
job_type='evaluation', # for organizing runs (e.g. evaluation)
config={'temperature': 0.2) # Hyper-parameters and other config

# My code and logic
# wandb logging (discussed below)

wandb.finish() # finish the run and commit results to W&B

Run set
106

See the full documentation for wandb.init for other arguments to customize its behavior.

What Can I log and How do I log it?

Within a run context, you can log all sorts of useful info such as metrics, visualizations, charts, and interactive data tables explicitly to record inputs, outputs and results of an LLM execution. Two primary functions to be aware of are:
  • wandb.log for logging time series, rich media, and tables
  • wandb.summary for logging aggregate scalar metrics for a run
Here is a comprehensive guide of wandb.log and its api docs.

Scalar Metrics

Scalar metrics can be logged by passing them in to wandb.log as a dictionary with a name.
for i in range(10):
wandb.log({"my_metric": some_scalar_value})
wandb.summary["my_metric"] = 0.9
wandb.summary can be used to log scalar values as part of the run summary. Summary metrics will appear in the Runs Table with a single value per run, whereas wandb.log allows you to log a series of metrics, plots, or tables as part of a run. In LLM world, this is like logging the average accuracy across 10 sample documents in your evaluation run vs. logging a single average accuracy for the evaluation run.
Each time wandb.log is called, that increments a variable W&B keeps track of called step. This is the (x-axis) you see with all the time-series charts. If you call wandb.log every epoch, then the step represents the epoch count, but you may be calling it other times in validation or testing loops, in which case the step is not as clear. To pass a step manually (simply add step = my_int_variable) to wandb.log. This can be important to getting your charts at the resolution you want. This will log a time series of values in the form of a line-plot to the workspace. This is mostly useful for model finetuning and training.

Run set
106


Tables

W&B Tables are interactive data frames logged to your workspace. You define the schema and what goes in them. Tables logged with the same name across runs are concatenated together into one table, allowing you to see and compare rows from different evaluation or inference runs in once place!
wandb.init(project="my_project", entity="my_team", job_type="llm_evaluation")
eval_table = wandb.Table(dataframe=my_pandas_df)
wandb.log({"my_eval_table": eval_table})
wandb.finish()

Run set
106


Table Filtering

There are two levels of filtering: 1) at the runs level and 2) at the table level. For instance we could first filter all gpt4_evaluation_runs which logged a rougeL score of at least 0.2. Then we can filter to just those rows in the table that had a gpt4 results of NO. You can filter a table by applying filter conditions to different column values similar to pandas. (See below as an example)

Run set
15


Table Grouping

You can filter a table by applying filter conditions to different column values similar to pandas. Scalars are aggregated into histograms for each group.

Run set
106


Creating New Columns

Tables are linked to the Run which generated them, meaning you can create new columns that bring in that run context, for instance the run config. We can also derive new columns from other columns, like computing the levenstein distance between two of the columns:

Run set
106


Logging Tables and Metrics as Unit Evaluation

In traditional software development, unit testing is done to test granular functionality as part of a larger system. In LLM evaluation, we are testing a non-deterministic system involving multiple functions getting composed together (e.g. for a RAG pipeline). Each piece (e.g. usually a Python function) should be evaluated independently first and the results can be logged as part of W&B tables and metrics. Each Run corresponds to one "unit test" or "unit evaluation" of a given function with associated inputs and outputs for an evaluation dataset, logged as a table.

Trace Tables

Trace tables are a special type of table which include metadata about nested function calls whose inputs and outputs are the entries in the table. Where W&B tables can be used to "unit evaluate" an application, Trace Tables are helpful in "integration evaluation" (ala "integration testing"), when we have a complex system (e.g. RAG pipeline) composing multiple LLM calls, tools, and databases and we need to see how everything comes together. These systems are commonly built using existing frameworks like LangChain and LLamaIndex.
W&B has integrations with both of these frameworks, which automatically track the entire executions of chains or retriever pipelines as rows in the Trace Table. Each row of the Trace Table is an entire execution of an LLM Chain, Agent, or RAG pipeline, with the Input and Output, being the starting input and ending output of the whole thing. Clicking on each row you will see the Trace Timeline which records all steps that happened in between with their respective inputs and outputs.
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

index = table_df[table_df["title"] == document_title].index[0]
db_dir = os.path.join("chromadb", str(index))
embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory=db_dir, embedding_function=embeddings)

prompt_template = """Use the following pieces of context to answer the question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Don't add your opinions or interpretations. Ensure that you complete the answer.
If the question is not relevant to the context, just say that it is not relevant.

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
retriever = db.as_retriever()
retriever.search_kwargs["k"] = 2
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)

with get_openai_callback() as cb:
result = qa({"query": question})

answer = result["result"]

Run set
7



Plots

There are a variety of ways to log charts to W&B, but they all boil down to two modes:
Static: Logging a pre-built chart (e.g. matplotlib, plotly)
Logging a pre-built chart works the same as logging any rich media type. You can take your chart generated through matplotlib or plotly, or serialize it to an image or html and log it using wandb.log:
from matplotlib.pyplot import figure
import mpld3

fig = figure()
ax = fig.gca()
ax.plot([1,2,3,4])

mpld3.save_html(fig,"test.html")

wandb.log({"matplotlib_to_html": wandb.Html(open("test.html"), inject=False)})
Dynamic: Logging a chart's raw data and dynamically rendering the chart in W&B
This requires logging the raw data backing a desired chart as a wandb.Table (see below) and then using Vega to render the data graphically in the W&B UI. Fortunately, W&B has some api abstractions under wandb.plot.<plot_type> which perform these two steps automatically for common charts out there and all you have to do is use the following pattern:
# Confusion matrices
wandb.log({"conf_mat": wandb.plot.confusion_matrix(y_true=ground_truth, preds=predictions, class_names=class_names)})

# ROC Curves
wandb.log({"roc": wandb.plot.roc_curve(ground_truth, predictions)})

# PR Curves
wandb.log({"pr": wandb.plot.pr_curve(ground_truth, predictions)})


The benefit of dynamic charts is they will overlay chart data from multiple runs, making it easier to compare runs against each other vs. across separate plots. For a full list of supported plots, check out this page. To create plots outside this list, you will need to log the raw data and use the Custom Chart Editor to edit/create a Vega spec to render the data how you like it:

W&B Artifacts

Artifacts enable you to track and version any serialized data as the inputs and outputs of runs. This can be datasets (e.g. image files), evaluation results (e.g. heatmaps), or model checkpoints. W&B is agnostic to the formats or structure of the data you want to log as an artifact.


Logging Artifacts

To log an artifact, you first create an Artifact object with a name , type, and optionally description and metadata dictionary. You can then add any of these to the artifact object:
  • local files
  • local directories
  • wandb Data Types (e.g. wandb.Plotly or wandb.Tables) which will render alongside the artifact in the UI
  • remote files and directories (e.g. s3 buckets)
wandb.init(project="pytorch-lightning-e2e", entity='wandb', job_type="upload_data")

# Create Artifact
training_images = wandb.Artifact(name='training_images',
type="training_data",
description='MNIST training data')

# Add serialized data i.e. directories, files, plots, html, W&B Tables
training_images.add_dir('./sample_images')

# Add other assets to better contextualize your artifact
training_images.add(wandb.Html('my_plotly_figure.html'), 'data_distribution_plot')

# Log to W&B, automatic versioning
wandb.log_artifact(training_images)
Each time you log this artifact, W&B will checksum the file assets you add to it and compare that to previous versions of the artifact. If there is a difference, a new version will be created, indicated by the alias v1 , v2, v3, etc. Users can optionally add/subtract additional aliases through the UI or API. Aliases are important because they uniquely identify an artifact version, so you can use them to pull down your best model for example.

Nature_100
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - raw_images
Nature_100:v0
Run - log_datasets
lilac-resonance-1
Runs
6
clean-violet-7
resume_training
hopeful-bird-6
resume_training
deft-oath-5
training
easy-wildflower-4
training
effortless-tree-3
training
hardy-butterfly-2
training

Consuming Artifacts

To consume an artifact, execute the following:
with wandb.init(project="pytorch-lightning-e2e", entity='wandb', job_type="model_training"})
# Indicate we are using a dependency
training_imgs_artifact = wandb.use_artifact("training_images:latest")
training_images_dir = training_imgs_artifact.download()

Tracking Artifacts By Reference

You may already have large datasets sitting in a cloud object store like s3 and just want to track what versions of those datasets Runs are utilizing and any other metadata associated with those datasets. You can do so by logging these artifacts by reference, in which case W&B only tracks the checksums and metadata of an artifact and does not copy the entire data asset to W&B. Here are some more details on tracking artifacts by reference.
With artifacts you can now refer to arbitrary data assets through durable and simple names and aliases (similar to how you deal with Docker containers). This makes it really easy to hand off these assets between people and processes and see the lineage of all data, models, and results.

Example: Seeing different versions of a Table

Any wandb.Table that you log is logged as an artifact of type Runs Table automatically. You can see all the tables you've logged and their respective versions. For instance, it is common to log a table periodically throughout a run:
wandb.init()

for i in range(epochs):
wandb.log({"my_table": wandb.Table(...)})
As discussed above, the default table view in the workspace will be the summary view i.e. the last table logged in the run. If you want to see and compare other versions of the table, go the artifacts tab of the project and look at the Runs Table artifacts. Find your table name and click on the version you care about. If you then go to Files and click on <my_table_name>.table.json you will see the table render.

Reports

Reports are flexible documents you can build on top of your W&B projects. You can easily embed any asset (chart, artifact, table) logged in W&B into a report alongside markdown, LaTeX, code blocks, etc. You can created rich documentation from your logged assets without copy-pasting static figures into word docs or managing excel spreadsheets. Reports are live in that as new experiments run, they will update accordingly. This report you are viewing is a good example of what all you can put into them.

Programmatic Reports

It may be useful to programmatically generate a report, such as for a standard model comparison analysis you might be doing repeatedly when retraining models, or after doing a large hyperparamater search. The W&B Python sdk provides a means of programmatically generating reports very easily under wandb.apis.reports. Check out the docs and this quickstart notebook.

Other Useful Resources

Import/Export API

All data logged to W&B can be accessed programmatically through the import/export API (also called the public API). This enables you to pull down run and artifact data, filter and manipulate it how you please in Python.
artifact