How to Compare Tables in Workspaces

Set up powerful and flexible analysis across runs logging structured data. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya

Jump to contents (click to expand)

Guide to Flexible and Fast Table Comparison

Our latest feature, W&B Tables for interactive dataset and prediction exploration, was initially designed to visualize W&B Artifacts. In the Artifacts context, Tables are typically tied to a particular version of your dataset or a fixed evaluation protocol for a specific model and Tables associated with Artifacts are generally easy to compare via the Artifact sidebar (full guide here).
However, you can also log Tables outside of the Artifacts context: directly to a project workspace with run.log(). This run.log() route for logging Tables is easier and faster when you want to quickly explore your data or model predictions, without necessarily versioning the contents. The guide below showcases how you can organize and compare Tables in Workspaces.

Log & Explore a Single Table

There are two ways to log a Table with wandb (covered in full detail here)

Run.log() to Create a Table in the Workspace

To log a Table to the workspace, construct the Table as usual (by adding rows or from a Pandas DataFrame) then call:
# add rows of local datadata = [[0, wand.Image("0.jpg"), "0", "0"], [1, wand.Image("1.jpg"), "8", "8"], [2, wand.Image("2.jpg"), "4", "9"]]columns = ["id", "image", "guess", "truth"]my_table = wandb.Table(data=data, columns=columns)# or initialize from existing datamy_table = wandb.Table(dataframe=df)# log the Table directly to a project{"my_table_key" : my_table})
This will create a Table with the name "my_table_key", visible in this particular run's workspace in a dedicated Table section with the heading run.summary["my_table_key"] (check out an example run workspace).
Since a workspace is for a single run, you can't compare multiple Tables from this single Table view, but you can explore all the standard Table interactions like sort, filter, group, add/edit/remove columns. As with all UI configuration/visualization changes in a run workspace, these interactions are saved and applied automatically to any future matching runs. Another instance of this Table will be logged to the project workspace—this is where comparison can be configured (see next section).
Try these interactions in the Table below, showing predictions on the full MNIST test dataset after training a simple CNN for 1 epoch. Scroll up/down and left/right inside the Table panel to see more contents, or advance to the next page via the arrows in the top right. The score_0 to score_9 columns show the model's normalized confidence score for each class label.

Include a Logged Table in a Report

There are two ways to add a Table like the one above to a Report:

Helpful Tips

The run workspace view is stateful and shared for all Tables under the same key

Any interactions you apply in the run workspace will persist across refreshes and crucially they will propagate to any other run workspace view of a Table logged under the name "my_table_key". You can always click "Reset & Automate Columns" to go back to the default state.

Unique IDs are optional

You can choose whether to log unique ids in each row (see the last section of this report for an example). If unique example identifiers are meaningful in your use case (e.g. the file names of each image in the training data), this can help map from specific images to the corresponding files for correcting labels, storing or sharing new dataset versions, etc. Creating unique ids for specific cases may also help you configure the right comparison across models: align on a combination of settings and prompt text, as in this example of text Tables, or ensure you're fairly evaluating GAN images from the same training step/matching stages across multiple model versions. If unique ids are irrelevant or tricky to access (e.g. a quick exploration), Tables can use the hash of an image as a unique identifier on which to join.

Save class labels as strings

For best results, cast any class labels to strings before logging them to a Table (integer labels would be a very reasonable default for the MNIST digits, as in the example above). Grouping by class label is a common operation in Tables to visualize model precision, recall, specific false positive/negative confusion, etc. In a "group by" operation, strings will appear as y-axis labels in histograms, while integers will instead appear as x-axis tick marks, which are much harder to read at a glance.

Enable a clean single-run Table view with a run name column

By default, Table panels are designed for multiple runs. If you'd like to visualize a single run in a Table panel, click on the gear icon in the right hand corner and change the "Merge By" strategy from "Joining" to "Concatenating". This will simplify the formatting of the columns and add a leftmost column with the run's name, which you can delete. We'll learn more about the merge strategy in the following section.

Compare Two or More Tables

Default view: The Panel is a list of Tables, with rows joined across runs

When multiple runs log a Table to the same key, their default comparison view looks like this. Individual Tables logged by each run are accessible via numerical index, or their order in the visible run set from top to bottom (so, row["guess"][0] is the first visible run's guesses, row["guess"][1] is the second visible run's guesses, etc). Here, two runs—a model variant called "double" in purple, indexed "0" and the "baseline" model in green, indexed "1"—both train for 1 epoch and log test predictions on 10K images to a Table descriptively named "test_images_10K_E1". Both models are toy convnets trained on MNIST in Pytorch (interactive colab here) , and the "double" variant multiplies both layer sizes and the learning rate by two.
The rows are joined across the two Table instances based on the image file hash. In each row of the Table, for each run, we see the model's prediction ("guess" column), the ground truth (identical across all runs), and the logits/confidence scores for each class, compared across model variants as histograms.

Helpful Tips

Panels render all the visible runs which logged a Table to the specified key

In this Report, you can try adding more model variants to any panel by toggling the blue "eyes" in the run set tab below the panel. For example, in the chart above showing 4 "CNN Variants", you can toggle the "eye" icons to show the blue and/or peach runs—or hide any runs already visible. This will update the color bars in the Table, letting you compare multiple models' predictions across the same images.
Any Table panel will try to render all the runs which are currently visible. Note that some of the visible runs may not have logged a Table instance to the specified key. All visible runs which logged a Table to the specified key will render in the Table panel with numeric indexes, starting from 0 for the first visible run, 1 for the second, etc, from top to bottom in the visible run set. You can use these indexes to refer to the specific runs when editing expressions in a column or the filter: for example, row["guess"][0] refers to the first visible run's guesses. If a visible run didn't log a Table for the chosen key ("test_images_10K_E1" above), it will not show up in the Table panel.

Controlling the set of visible runs/Tables

The visible runs have a blue/active "eye" icon to the left of the run name, and this run set appears
You can control which runs are rendered at any given time via the "eye" icons. Note that it's much easier to organize, manage, and save different Table views in Reports, as you can change run sets across sections, whereas the Workspace has just one set of visible runs, shown in the sidebar. The Workspace will always save your most recent settings and is best used as a scratchpad for quick exploration.

Organize Tables with clear descriptive names

You can only compare Tables if they share a name (are logged under the same key). You cannot edit this key after logging, though you can always log a new instance. For now, the bidirectional mapping of run names to Table names is implicit, so you can't see the full set of runs which have logged Tables to a particular key, and you can't easily see the set of Table names logged by a particular run, except by scrolling through that run's workspace.
To keep Tables organized:

Merge Strategy: Join on a Particular Field, Concatenate to Stack/Append

You can change the operation used to merge rows across the runs via the gear icon in the top right corner of a Table panel:
Joining on a key across multiple runs creates a single column for the shared join key and additional columns for each of the columns in each of the runs. Outer vs inner join behaves intuitively across this join key. Note that this enables arbitrary joins—you may not want to launch a join on a column of floats across 100K rows :)
The alternative strategy to joining is concatenating: stacking all the Tables together into one long list of rows. In this view, columns refer to the union across all runs visible in the Table, and we can no longer index into individual runs (e.g. row["guess"][0] vs row["guess"][1] is no longer meaningful, there is one row["guess"] column which contains all of model 0 and model 1's values).
Note that all the available columns are listed as options for Join Key, even if they aren't relevant/reasonable. For example, joining on confidence scores for a particular class across thousands of images is a feasible, but likely useless, query.

Compare Across Examples by Content: Sort by Column

How can we compare performance across multiple models using a new metric? Starting from the default comparison view, with the "Merge By" strategy set to "Joining":
Try adding your own derived columns and sorting by their values below. Remember you can also filter and change the run set.
Note: I edited the column order below so that "score_2" appears right next to the derived "diff score 2" column (three-dot header menu > choose a different column name).

Aggregate Precision: Group Joined view by Truth, Filter for Errors

Starting from a default Table comparing two or more runs, look at aggregate model precision or false negatives: given the true labels, what did the models guess? Are there any patterns across the errors or model variants?
Below, scroll right to see comparative score distributions for all the classes. You can also toggle run visibility to compare more models.

Aggregate Recall: Group Concatenated View by Guess, Filter for Errors

Starting from a default Table comparing two or more runs, look at aggregate model recall or false positives: given the models' predictions, what were the actual true labels? Are there any patterns across the variants?
Below, scroll right to see comparative score distributions for all the classes. You can also toggle run visibility to explore other models (from the "CNN Variants" run set menu). Note that the "image" column may contain many copies of the same image—since we've concatenated the Tables across models, we've flattened all the rows and are no longer tracking which image came from which Table. We can still use the truth and score columns to understand patterns in the mistakes. In this example, looking at recall/false positives, you can see the purple "double" model outperforms the green "baseline" model on 0s, 1s, 2s (substantially more green in than purple in the histograms), is comparable on 5s, 6s, 7s, 9s (similar bar areas for green and purple), and does worse on 3s and 8s (more purple than green).

Tandem Comparison: Side-by-Side View of Matched Tables Across Independent Runs

You can also compare multiple models side-by-side—this is my favorite view.
You can change the number of vertical sections by toggling visibility in the "Double vs Baseline" run set tab below the panel. Scroll right in each vertical section to view the label and score distributions for the the corresponding model.
Any Table operations like sorting, filtering, and grouping will apply to all runs in tandem, but the images/examples associated with different models (predictions, generated samples, etc) will be clearly separated. This high-level view enables quick comparisons and insights. As we saw in the previous section on aggregate recall, the purple "double" variant (left) generally outperforms the baseline the green "baseline" (right). In this view, we can easily tell which model gets which images wrong, and see exact counts for each group. Here we see that "double" performs much worse than the baseline on the digits 3, 8, and 9. We can browse more examples, or create a new view of the Table focusing specifically on confusion across these classes.

Generic Comparison: Side-by-side view of Independent Tables Across Matched Runs

You can also view Tables side-by-side in the same section, with the same set of visible runs, without synchronizing operations across the individual Table panels. In the example below, I've added two completely independent Table panels to the same panel grid or report section. You can accomplish this by:
Here I am comparing model predictions across the "baseline" and "double" models after 1 epoch of training (left panel, Table key "test_images_10K_E1") and after 5 epochs of training (right panel, Table key "test_images_10K_E5"). Any sorting, filtering, grouping, or other analysis I do in one Table will not be matched in the other Table, so I can configure completely independent views and still align them visually. Below, I'm showing the images with the most disagreement between the two models on the digit being a 3 (left) versus a 9 (right).

Flexible intersection of run sets and Table names

4 runs are visible in this run set: "baseline" and "double", which only logged Tables to the "test_images_10K_E1" key, and "baseline E5" and "double E5", which only logged Tables to the "test_images_10K_E5" key. The two Table panels below thus have no overlap and are totally independent of each other, but this doesn't need to be the case. The same run could log to multiple Table keys and appear on both sides. Overall, this panel grid will render the intersection between all the Tables logged by the visible runs and all the Tables matching the Table key specified at the top of each panel.

P.S. Compare Across Examples by Unique ID

Below I've logged two tables with an explicit "id" column and used it to join, then group by truth to evaluate precision across test predictions from the now-familiar "baseline" and "double" models. My ids are generated from the sequential numbering of the test image batches, then the numbering of the images in each batch.
This join-on-ids mode is especially useful to compare images, audio, or other generated output from different models—here you'll see that the images are identical across ids (since they're the test dataset of MNIST). Note that the guess and truth column show histograms instead of numbers—this is because I logged my class labels as ints and why I recommend logging them as easier-to-read strings.
In the grouped view, I've removed this image column so you can see the contrast in prediction distributions right away: "double" outperforms "baseline" much more obviously in this comparison, based on the near-perfect peaks of correct guesses for purple compared to the broad, equally-likely distribution of green in the "guess" column. To me, this suggests some other difference between the models, relative to how I ran them in the rest of the project—perhaps most of the issue is due to this being an unfair comparison. As you can see from the run set below the Tables, the "double" model actually trains with a batch size of 64, while the "baseline" has a batch size of 32.