How to Compare Tables in Workspaces
Set up powerful and flexible analysis across runs logging structured data
Created on July 1|Last edited on March 29
Comment
Jump to contents (click to expand)
Guide to Flexible and Fast Table Comparison
Our latest feature, W&B Tables for interactive dataset and prediction exploration, was initially designed to visualize W&B Artifacts. In the Artifacts context, Tables are typically tied to a particular version of your dataset or a fixed evaluation protocol for a specific model and Tables associated with Artifacts are generally easy to compare via the Artifact sidebar (full guide here).
However, you can also log Tables outside of the Artifacts context: directly to a project workspace with run.log(). This run.log() route for logging Tables is easier and faster when you want to quickly explore your data or model predictions, without necessarily versioning the contents. The guide below showcases how you can organize and compare Tables in Workspaces.
Log & Explore a Single Table
- run.log(): associate a Table with a particular experiment run. This is the recommended default and most useful for saving a model's predictions or samples generated during training or testing. It's ideal for faster exploration, and we'll focus on this method in this report.
- artifact.add(): associate a Table with a particular dataset or model version. This is most useful when you plan to reference/reuse a specific dataset version in future runs. Refer to this section of the docs for Table comparison from the Artifacts context.
Run.log() to Create a Table in the Workspace
To log a Table to the workspace, construct the Table as usual (by adding rows or from a Pandas DataFrame) then call:
# add rows of local datadata = [[0, wand.Image("0.jpg"), "0", "0"],[1, wand.Image("1.jpg"), "8", "8"],[2, wand.Image("2.jpg"), "4", "9"]]columns = ["id", "image", "guess", "truth"]my_table = wandb.Table(data=data, columns=columns)# or initialize from existing datamy_table = wandb.Table(dataframe=df)# log the Table directly to a project workspacewandb.run.log({"my_table_key" : my_table})
This will create a Table with the name "my_table_key", visible in this particular run's workspace in a dedicated Table section with the heading run.summary["my_table_key"] (check out an example run workspace).
Since a workspace is for a single run, you can't compare multiple Tables from this single Table view, but you can explore all the standard Table interactions like sort, filter, group, add/edit/remove columns. As with all UI configuration/visualization changes in a run workspace, these interactions are saved and applied automatically to any future matching runs. Another instance of this Table will be logged to the project workspace—this is where comparison can be configured (see next section).
Try these interactions in the Table below, showing predictions on the full MNIST test dataset after training a simple CNN for 1 epoch. Scroll up/down and left/right inside the Table panel to see more contents, or advance to the next page via the arrows in the top right. The score_0 to score_9 columns show the model's normalized confidence score for each class label.
Baseline CNN, 1 epoch
1
Include a Logged Table in a Report
There are two ways to add a Table like the one above to a Report:
- export a Table from your workspace: from the run workspace or main project workspace, click on the three dots in the upper right of a Table then "Add to Report". You can now choose an existing Report and section, or create a new Report
- add a Weave panel: from the Report editing view, add a Panel Grid (by typing "/Panel grid"), then add a "Weave" visualization. Type runs.summary["your_table_key"] in the top search bar of the Weave panel (autocompletions should help here).
Helpful Tips
The run workspace view is stateful and shared for all Tables under the same key
Any interactions you apply in the run workspace will persist across refreshes and crucially they will propagate to any other run workspace view of a Table logged under the name "my_table_key". You can always click "Reset & Automate Columns" to go back to the default state.
Unique IDs are optional
You can choose whether to log unique ids in each row (see the last section of this report for an example). If unique example identifiers are meaningful in your use case (e.g. the file names of each image in the training data), this can help map from specific images to the corresponding files for correcting labels, storing or sharing new dataset versions, etc. Creating unique ids for specific cases may also help you configure the right comparison across models: align on a combination of settings and prompt text, as in this example of text Tables, or ensure you're fairly evaluating GAN images from the same training step/matching stages across multiple model versions. If unique ids are irrelevant or tricky to access (e.g. a quick exploration), Tables can use the hash of an image as a unique identifier on which to join.
Save class labels as strings
For best results, cast any class labels to strings before logging them to a Table (integer labels would be a very reasonable default for the MNIST digits, as in the example above). Grouping by class label is a common operation in Tables to visualize model precision, recall, specific false positive/negative confusion, etc. In a "group by" operation, strings will appear as y-axis labels in histograms, while integers will instead appear as x-axis tick marks, which are much harder to read at a glance.
Enable a clean single-run Table view with a run name column
By default, Table panels are designed for multiple runs. If you'd like to visualize a single run in a Table panel, click on the gear icon in the right hand corner and change the "Merge By" strategy from "Joining" to "Concatenating". This will simplify the formatting of the columns and add a leftmost column with the run's name, which you can delete. We'll learn more about the merge strategy in the following section.
Compare Two or More Tables
Default view: The Panel is a list of Tables, with rows joined across runs
When multiple runs log a Table to the same key, their default comparison view looks like this. Individual Tables logged by each run are accessible via numerical index, or their order in the visible run set from top to bottom (so, row["guess"][0] is the first visible run's guesses, row["guess"][1] is the second visible run's guesses, etc). Here, two runs—a model variant called "double" in purple, indexed "0" and the "baseline" model in green, indexed "1"—both train for 1 epoch and log test predictions on 10K images to a Table descriptively named "test_images_10K_E1". Both models are toy convnets trained on MNIST in Pytorch (interactive colab here) , and the "double" variant multiplies both layer sizes and the learning rate by two.
The rows are joined across the two Table instances based on the image file hash. In each row of the Table, for each run, we see the model's prediction ("guess" column), the ground truth (identical across all runs), and the logits/confidence scores for each class, compared across model variants as histograms.
CNN Variants
2
Helpful Tips
Panels render all the visible runs which logged a Table to the specified key
In this Report, you can try adding more model variants to any panel by toggling the blue "eyes" in the run set tab below the panel. For example, in the chart above showing 4 "CNN Variants", you can toggle the "eye" icons to show the blue and/or peach runs—or hide any runs already visible. This will update the color bars in the Table, letting you compare multiple models' predictions across the same images.
Any Table panel will try to render all the runs which are currently visible. Note that some of the visible runs may not have logged a Table instance to the specified key. All visible runs which logged a Table to the specified key will render in the Table panel with numeric indexes, starting from 0 for the first visible run, 1 for the second, etc, from top to bottom in the visible run set. You can use these indexes to refer to the specific runs when editing expressions in a column or the filter: for example, row["guess"][0] refers to the first visible run's guesses. If a visible run didn't log a Table for the chosen key ("test_images_10K_E1" above), it will not show up in the Table panel.
Controlling the set of visible runs/Tables
The visible runs have a blue/active "eye" icon to the left of the run name, and this run set appears
- below each Panel Grid in a Report (fixed for all panels in each Panel Grid, or one set per Report section)
- in the left side bar in a Workspace (fixed for all sections in the Workspace, or one set per entire Workspace)
You can control which runs are rendered at any given time via the "eye" icons. Note that it's much easier to organize, manage, and save different Table views in Reports, as you can change run sets across sections, whereas the Workspace has just one set of visible runs, shown in the sidebar. The Workspace will always save your most recent settings and is best used as a scratchpad for quick exploration.
Organize Tables with clear descriptive names
You can only compare Tables if they share a name (are logged under the same key). You cannot edit this key after logging, though you can always log a new instance. For now, the bidirectional mapping of run names to Table names is implicit, so you can't see the full set of runs which have logged Tables to a particular key, and you can't easily see the set of Table names logged by a particular run, except by scrolling through that run's workspace.
To keep Tables organized:
- give all Tables clear descriptive names, perhaps including the type of data logged, the expected number of samples, the workflow stage, etc.
- annotate runs to indicate which Tables they logged, using descriptive run names or custom tags, both of which can be set programmatically or modified afterwards
- use other configuration settings like the optional "job_type" or "group" arguments to wandb.init() to organize visible run sets, e.g. keep exploratory analysis scripts—and the associated Table views—separate from training runs or evaluation runs—and any associated Tables of predictions, validation results, etc.
Merge Strategy: Join on a Particular Field, Concatenate to Stack/Append
You can change the operation used to merge rows across the runs via the gear icon in the top right corner of a Table panel:

Joining on a key across multiple runs creates a single column for the shared join key and additional columns for each of the columns in each of the runs. Outer vs inner join behaves intuitively across this join key. Note that this enables arbitrary joins—you may not want to launch a join on a column of floats across 100K rows :)
The alternative strategy to joining is concatenating: stacking all the Tables together into one long list of rows. In this view, columns refer to the union across all runs visible in the Table, and we can no longer index into individual runs (e.g. row["guess"][0] vs row["guess"][1] is no longer meaningful, there is one row["guess"] column which contains all of model 0 and model 1's values).
Note that all the available columns are listed as options for Join Key, even if they aren't relevant/reasonable. For example, joining on confidence scores for a particular class across thousands of images is a feasible, but likely useless, query.
Compare Across Examples by Content: Sort by Column
How can we compare performance across multiple models using a new metric? Starting from the default comparison view, with the "Merge By" strategy set to "Joining":
- insert a column: hover over the three-dot menu in any column header and insert a column to the left or right
- compute a new metric: use any combination of existing column names, mathematical expressions, and some aggregation keywords (sum, avg, count) to compose a new metric. Below, I've added the column "diff score 2" with the expression row["score_2"][0]-row["score_2"][1]. This computes the difference between the confidence scores for digit=2 between the first two models visible in this panel. I can sort by this column to see the images on which the two models maximally differ (or maximally agree).
Try adding your own derived columns and sorting by their values below. Remember you can also filter and change the run set.
Note: I edited the column order below so that "score_2" appears right next to the derived "diff score 2" column (three-dot header menu > choose a different column name).
CNN Variants
2
Aggregate Precision: Group Joined view by Truth, Filter for Errors
Starting from a default Table comparing two or more runs, look at aggregate model precision or false negatives: given the true labels, what did the models guess? Are there any patterns across the errors or model variants?
- group by truth: hover over the "truth" column header and click on the column name or the three dots to the far right. Select "group by".
- edit grouped column to show a single class label: from "Column settings" via the column name or three dot menu, edit the cell expression from row["truth"] to row["truth"][0]. The true label is identical across model variants, and this formats the label as a single string instead of an array of identical true labels for each model variant.
- sort by truth[0] to view a consistent ordering of labels
- focus on errors (optional): since this model is very accurate, it can be helpful to zoom in on the incorrect predictions. Use the filter expression to select only for wrong guesses for each model: row["guess"][0] != row["truth"][0] and row["guess"][1] != row["truth"][1]. If you add more than two runs to this panel, you'll need to modify the filter expression to account for each run.
Below, scroll right to see comparative score distributions for all the classes. You can also toggle run visibility to compare more models.
CNN Variants
2
Aggregate Recall: Group Concatenated View by Guess, Filter for Errors
Starting from a default Table comparing two or more runs, look at aggregate model recall or false positives: given the models' predictions, what were the actual true labels? Are there any patterns across the variants?
- switch merge strategy to "Concatenating": from the gear icon in the top right, change the "Merge By" setting to "Concatenating" to flatten across all the rows (or each Table) from the runs/models you're comparing
- group by guess: hover over the "guess" column header, click on the column name or the three dots to the far right and select "group by"
- sort by guess (optional) to get a stable ordering of labels
- focus on errors (optional): since these MNIST models are very accurate, it can be helpful to zoom in on the incorrect predictions. Use the filter expression to show only the wrong guesses for each model: row["guess"] != row["truth"]. Note that in the concatenated view, we don't need to index into the different models, because all their predictions are flattened into one big list.
Below, scroll right to see comparative score distributions for all the classes. You can also toggle run visibility to explore other models (from the "CNN Variants" run set menu). Note that the "image" column may contain many copies of the same image—since we've concatenated the Tables across models, we've flattened all the rows and are no longer tracking which image came from which Table. We can still use the truth and score columns to understand patterns in the mistakes. In this example, looking at recall/false positives, you can see the purple "double" model outperforms the green "baseline" model on 0s, 1s, 2s (substantially more green in than purple in the histograms), is comparable on 5s, 6s, 7s, 9s (similar bar areas for green and purple), and does worse on 3s and 8s (more purple than green).
CNN Variants
2
Tandem Comparison: Side-by-Side View of Matched Tables Across Independent Runs
You can also compare multiple models side-by-side—this is my favorite view.
- merge via "Concatenate"
- fetch the table.rows: modify the query at the top of the panel, adding ".table.rows" after the desired Table key (the full expression becomes runs.summary["test_images_10K_E1"].table.rows, as shown below). The display mode in the top right corner should now say "Row.Table", and you will see multiple vertical sections of your Tables in one panel, one section for each visible run. I recommend keeping this panel to 2-3 runs for legibility.
- apply matched Table operations (optional): any Table operations you apply to one Table will now apply to all the Tables in the panel in tandem. Below, I filter for incorrect guesses (row["guess"] != row["truth"]), group by "guess", sort by the grouped guess value, and increase the image column to the maximum size of 10 to get an overview of the kinds of incorrect predictions in the new "double" model (left) versus my "baseline" (right). Note that you can tell which model is which based on the color of the histogram bars (sleek UI improvements for this coming soon!).
You can change the number of vertical sections by toggling visibility in the "Double vs Baseline" run set tab below the panel. Scroll right in each vertical section to view the label and score distributions for the the corresponding model.
Any Table operations like sorting, filtering, and grouping will apply to all runs in tandem, but the images/examples associated with different models (predictions, generated samples, etc) will be clearly separated. This high-level view enables quick comparisons and insights. As we saw in the previous section on aggregate recall, the purple "double" variant (left) generally outperforms the baseline the green "baseline" (right). In this view, we can easily tell which model gets which images wrong, and see exact counts for each group. Here we see that "double" performs much worse than the baseline on the digits 3, 8, and 9. We can browse more examples, or create a new view of the Table focusing specifically on confusion across these classes.
Double vs Baseline
2
Generic Comparison: Side-by-side view of Independent Tables Across Matched Runs
You can also view Tables side-by-side in the same section, with the same set of visible runs, without synchronizing operations across the individual Table panels. In the example below, I've added two completely independent Table panels to the same panel grid or report section. You can accomplish this by:
- exporting panels from workspaces to the same Report section, possibly dragging and dropping and resizing them to get the arrangement you want
- using "Add panel" => "Weave" => typing runs.summary["your_table_key_here"] to create a new Table panel
Here I am comparing model predictions across the "baseline" and "double" models after 1 epoch of training (left panel, Table key "test_images_10K_E1") and after 5 epochs of training (right panel, Table key "test_images_10K_E5"). Any sorting, filtering, grouping, or other analysis I do in one Table will not be matched in the other Table, so I can configure completely independent views and still align them visually. Below, I'm showing the images with the most disagreement between the two models on the digit being a 3 (left) versus a 9 (right).
Flexible intersection of run sets and Table names
4 runs are visible in this run set: "baseline" and "double", which only logged Tables to the "test_images_10K_E1" key, and "baseline E5" and "double E5", which only logged Tables to the "test_images_10K_E5" key. The two Table panels below thus have no overlap and are totally independent of each other, but this doesn't need to be the case. The same run could log to multiple Table keys and appear on both sides. Overall, this panel grid will render the intersection between all the Tables logged by the visible runs and all the Tables matching the Table key specified at the top of each panel.
CNN Variants
4
P.S. Compare Across Examples by Unique ID
Below I've logged two tables with an explicit "id" column and used it to join, then group by truth to evaluate precision across test predictions from the now-familiar "baseline" and "double" models. My ids are generated from the sequential numbering of the test image batches, then the numbering of the images in each batch.
This join-on-ids mode is especially useful to compare images, audio, or other generated output from different models—here you'll see that the images are identical across ids (since they're the test dataset of MNIST). Note that the guess and truth column show histograms instead of numbers—this is because I logged my class labels as ints and why I recommend logging them as easier-to-read strings.
In the grouped view, I've removed this image column so you can see the contrast in prediction distributions right away: "double" outperforms "baseline" much more obviously in this comparison, based on the near-perfect peaks of correct guesses for purple compared to the broad, equally-likely distribution of green in the "guess" column. To me, this suggests some other difference between the models, relative to how I ran them in the rest of the project—perhaps most of the issue is due to this being an unfair comparison. As you can see from the run set below the Tables, the "double" model actually trains with a batch size of 64, while the "baseline" has a batch size of 32.
Double vs. Baseline with ids
2
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.