W&B Best Practices Guide for LLMs
Created on January 12|Last edited on January 22
Comment
W&B Experiments and LoggingSDK Installation and LoginW&B RunsWhat Can I log and How do I log it? Scalar MetricsTablesTable FilteringTable GroupingCreating New ColumnsLogging Tables and Metrics as Unit EvaluationTrace TablesPlotsW&B ArtifactsLogging ArtifactsConsuming ArtifactsTracking Artifacts By ReferenceExample: Seeing different versions of a TableReportsProgrammatic ReportsOther Useful ResourcesImport/Export API
W&B Experiments and Logging
SDK Installation and Login
To start using W&B, you first need to install the Python package (if it's not already there)
pip install wandb
Once it's installed, authenticate your user account by logging in through the CLI or SDK. You should have receive an email to sign up to the platform, after which you can obtain your API token
wandb login --host <YOUR W&B HOST URL> <YOUR API TOKEN>
OR through Python:
wandb.login(host=os.getenv("WANDB_BASE_URL"), key=os.getenv("WANDB_API_KEY"))
Once you are logged in, you are ready to track your workflows!
W&B Runs
At the core of W&B is a Run, which is a logged unit of execution of Python code. A Run captures the entire execution context of that unit: Python library versions, hardware info, system metrics, git state, etc. It is best to create a run for each granular piece of your pipeline, so it's easy to identify the inputs and outputs of a given piece of code. For example, you might have a Run for each of:
- LLM Inference on a dataset
- Creation of embeddings from some documents
- GPT4 Evaluation of LLM outputs
To create a run, call wandb.init(). There are a bunch of important arguments you can pass to wandb.init() to provide additional context for the run and enable you to organize your runs later:
import wandbwandb.init(project="prompt-playground",entity="wandb-smle", # Teamjob_type='evaluation', # for organizing runs (e.g. evaluation)config={'temperature': 0.2) # Hyper-parameters and other config# My code and logic# wandb logging (discussed below)wandb.finish() # finish the run and commit results to W&B
Run set
106
What Can I log and How do I log it?
Within a run context, you can log all sorts of useful info such as metrics, visualizations, charts, and interactive data tables explicitly to record inputs, outputs and results of an LLM execution. Two primary functions to be aware of are:
- wandb.log for logging time series, rich media, and tables
- wandb.summary for logging aggregate scalar metrics for a run
Scalar Metrics
Scalar metrics can be logged by passing them in to wandb.log as a dictionary with a name.
for i in range(10):wandb.log({"my_metric": some_scalar_value})wandb.summary["my_metric"] = 0.9
wandb.summary can be used to log scalar values as part of the run summary. Summary metrics will appear in the Runs Table with a single value per run, whereas wandb.log allows you to log a series of metrics, plots, or tables as part of a run. In LLM world, this is like logging the average accuracy across 10 sample documents in your evaluation run vs. logging a single average accuracy for the evaluation run.
Each time wandb.log is called, that increments a variable W&B keeps track of called step. This is the (x-axis) you see with all the time-series charts. If you call wandb.log every epoch, then the step represents the epoch count, but you may be calling it other times in validation or testing loops, in which case the step is not as clear. To pass a step manually (simply add step = my_int_variable) to wandb.log. This can be important to getting your charts at the resolution you want. This will log a time series of values in the form of a line-plot to the workspace. This is mostly useful for model finetuning and training.
Run set
106
Tables
W&B Tables are interactive data frames logged to your workspace. You define the schema and what goes in them. Tables logged with the same name across runs are concatenated together into one table, allowing you to see and compare rows from different evaluation or inference runs in once place!
wandb.init(project="my_project", entity="my_team", job_type="llm_evaluation")eval_table = wandb.Table(dataframe=my_pandas_df)wandb.log({"my_eval_table": eval_table})wandb.finish()
Run set
106
Table Filtering
There are two levels of filtering: 1) at the runs level and 2) at the table level. For instance we could first filter all gpt4_evaluation_runs which logged a rougeL score of at least 0.2. Then we can filter to just those rows in the table that had a gpt4 results of NO. You can filter a table by applying filter conditions to different column values similar to pandas. (See below as an example)
Run set
15
Table Grouping
You can filter a table by applying filter conditions to different column values similar to pandas. Scalars are aggregated into histograms for each group.
Run set
106
Creating New Columns
Tables are linked to the Run which generated them, meaning you can create new columns that bring in that run context, for instance the run config. We can also derive new columns from other columns, like computing the levenstein distance between two of the columns:
Run set
106
Logging Tables and Metrics as Unit Evaluation
In traditional software development, unit testing is done to test granular functionality as part of a larger system. In LLM evaluation, we are testing a non-deterministic system involving multiple functions getting composed together (e.g. for a RAG pipeline). Each piece (e.g. usually a Python function) should be evaluated independently first and the results can be logged as part of W&B tables and metrics. Each Run corresponds to one "unit test" or "unit evaluation" of a given function with associated inputs and outputs for an evaluation dataset, logged as a table.
Trace Tables
Trace tables are a special type of table which include metadata about nested function calls whose inputs and outputs are the entries in the table. Where W&B tables can be used to "unit evaluate" an application, Trace Tables are helpful in "integration evaluation" (ala "integration testing"), when we have a complex system (e.g. RAG pipeline) composing multiple LLM calls, tools, and databases and we need to see how everything comes together. These systems are commonly built using existing frameworks like LangChain and LLamaIndex.
W&B has integrations with both of these frameworks, which automatically track the entire executions of chains or retriever pipelines as rows in the Trace Table. Each row of the Trace Table is an entire execution of an LLM Chain, Agent, or RAG pipeline, with the Input and Output, being the starting input and ending output of the whole thing. Clicking on each row you will see the Trace Timeline which records all steps that happened in between with their respective inputs and outputs.
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"index = table_df[table_df["title"] == document_title].index[0]db_dir = os.path.join("chromadb", str(index))embeddings = OpenAIEmbeddings()db = Chroma(persist_directory=db_dir, embedding_function=embeddings)prompt_template = """Use the following pieces of context to answer the question.If you don't know the answer, just say that you don't know, don't try to make up an answer.Don't add your opinions or interpretations. Ensure that you complete the answer.If the question is not relevant to the context, just say that it is not relevant.CONTEXT:{context}QUESTION: {question}ANSWER:"""prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])retriever = db.as_retriever()retriever.search_kwargs["k"] = 2qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=0),chain_type="stuff",retriever=retriever,chain_type_kwargs={"prompt": prompt},return_source_documents=True)with get_openai_callback() as cb:result = qa({"query": question})answer = result["result"]
Run set
7
Plots
There are a variety of ways to log charts to W&B, but they all boil down to two modes:
Static: Logging a pre-built chart (e.g. matplotlib, plotly)
Logging a pre-built chart works the same as logging any rich media type. You can take your chart generated through matplotlib or plotly, or serialize it to an image or html and log it using wandb.log:
from matplotlib.pyplot import figureimport mpld3fig = figure()ax = fig.gca()ax.plot([1,2,3,4])mpld3.save_html(fig,"test.html")wandb.log({"matplotlib_to_html": wandb.Html(open("test.html"), inject=False)})
Dynamic: Logging a chart's raw data and dynamically rendering the chart in W&B
This requires logging the raw data backing a desired chart as a wandb.Table (see below) and then using Vega to render the data graphically in the W&B UI. Fortunately, W&B has some api abstractions under wandb.plot.<plot_type> which perform these two steps automatically for common charts out there and all you have to do is use the following pattern:
# Confusion matriceswandb.log({"conf_mat": wandb.plot.confusion_matrix(y_true=ground_truth, preds=predictions, class_names=class_names)})# ROC Curveswandb.log({"roc": wandb.plot.roc_curve(ground_truth, predictions)})# PR Curveswandb.log({"pr": wandb.plot.pr_curve(ground_truth, predictions)})
The benefit of dynamic charts is they will overlay chart data from multiple runs, making it easier to compare runs against each other vs. across separate plots. For a full list of supported plots, check out this page. To create plots outside this list, you will need to log the raw data and use the Custom Chart Editor to edit/create a Vega spec to render the data how you like it:
W&B Artifacts
Artifacts enable you to track and version any serialized data as the inputs and outputs of runs. This can be datasets (e.g. image files), evaluation results (e.g. heatmaps), or model checkpoints. W&B is agnostic to the formats or structure of the data you want to log as an artifact.

Logging Artifacts
To log an artifact, you first create an Artifact object with a name , type, and optionally description and metadata dictionary. You can then add any of these to the artifact object:
- local files
- local directories
- wandb Data Types (e.g. wandb.Plotly or wandb.Tables) which will render alongside the artifact in the UI
- remote files and directories (e.g. s3 buckets)
wandb.init(project="pytorch-lightning-e2e", entity='wandb', job_type="upload_data")# Create Artifacttraining_images = wandb.Artifact(name='training_images',type="training_data",description='MNIST training data')# Add serialized data i.e. directories, files, plots, html, W&B Tablestraining_images.add_dir('./sample_images')# Add other assets to better contextualize your artifacttraining_images.add(wandb.Html('my_plotly_figure.html'), 'data_distribution_plot')# Log to W&B, automatic versioningwandb.log_artifact(training_images)
Each time you log this artifact, W&B will checksum the file assets you add to it and compare that to previous versions of the artifact. If there is a difference, a new version will be created, indicated by the alias v1 , v2, v3, etc. Users can optionally add/subtract additional aliases through the UI or API. Aliases are important because they uniquely identify an artifact version, so you can use them to pull down your best model for example.
Nature_100
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Consuming Artifacts
To consume an artifact, execute the following:
with wandb.init(project="pytorch-lightning-e2e", entity='wandb', job_type="model_training"})# Indicate we are using a dependencytraining_imgs_artifact = wandb.use_artifact("training_images:latest")training_images_dir = training_imgs_artifact.download()
Tracking Artifacts By Reference
You may already have large datasets sitting in a cloud object store like s3 and just want to track what versions of those datasets Runs are utilizing and any other metadata associated with those datasets. You can do so by logging these artifacts by reference, in which case W&B only tracks the checksums and metadata of an artifact and does not copy the entire data asset to W&B. Here are some more details on tracking artifacts by reference.
With artifacts you can now refer to arbitrary data assets through durable and simple names and aliases (similar to how you deal with Docker containers). This makes it really easy to hand off these assets between people and processes and see the lineage of all data, models, and results.
Example: Seeing different versions of a Table
Any wandb.Table that you log is logged as an artifact of type Runs Table automatically. You can see all the tables you've logged and their respective versions. For instance, it is common to log a table periodically throughout a run:
wandb.init()for i in range(epochs):wandb.log({"my_table": wandb.Table(...)})
As discussed above, the default table view in the workspace will be the summary view i.e. the last table logged in the run. If you want to see and compare other versions of the table, go the artifacts tab of the project and look at the Runs Table artifacts. Find your table name and click on the version you care about. If you then go to Files and click on <my_table_name>.table.json you will see the table render.
Reports
Reports are flexible documents you can build on top of your W&B projects. You can easily embed any asset (chart, artifact, table) logged in W&B into a report alongside markdown, LaTeX, code blocks, etc. You can created rich documentation from your logged assets without copy-pasting static figures into word docs or managing excel spreadsheets. Reports are live in that as new experiments run, they will update accordingly. This report you are viewing is a good example of what all you can put into them.
Programmatic Reports
It may be useful to programmatically generate a report, such as for a standard model comparison analysis you might be doing repeatedly when retraining models, or after doing a large hyperparamater search. The W&B Python sdk provides a means of programmatically generating reports very easily under wandb.apis.reports. Check out the docs and this quickstart notebook.
Other Useful Resources
Import/Export API
All data logged to W&B can be accessed programmatically through the import/export API (also called the public API). This enables you to pull down run and artifact data, filter and manipulate it how you please in Python.
Add a comment