Test Exploration For Table Joins
Set up powerful and flexible analysis across runs logging structured data
Created on January 24|Last edited on January 26
Comment
maybe helpful colab to reproduce: wandb.me/dsviz-mnist-colab
[P0] workspace with UI table joins on image/id doesn't work anymore
[P1] join on id in Report
[below this is WIP]
Compare Two or More Tables
Default view: The Panel is a list of Tables, with rows joined across runs
When multiple runs log a Table to the same key, their default comparison view looks like this. Individual Tables logged by each run are accessible via numerical index, or their order in the visible run set from top to bottom (so, row["guess"][0] is the first visible run's guesses, row["guess"][1] is the second visible run's guesses, etc). Here, two runs—a model variant called "double" in purple, indexed "0" and the "baseline" model in green, indexed "1"—both train for 1 epoch and log test predictions on 10K images to a Table descriptively named "test_images_10K_E1". Both models are toy convnets trained on MNIST in Pytorch (interactive colab here) , and the "double" variant multiplies both layer sizes and the learning rate by two.
The rows are joined across the two Table instances based on the image file hash. In each row of the Table, for each run, we see the model's prediction ("guess" column), the ground truth (identical across all runs), and the logits/confidence scores for each class, compared across model variants as histograms.
Helpful Tips
Merge Strategy: Join on a Particular Field, Concatenate to Stack/Append
Compare Across Examples by Content: Sort by Column
Aggregate Precision: Group Joined view by Truth, Filter for Errors
Aggregate Recall: Group Concatenated View by Guess, Filter for Errors
Tandem Comparison: Side-by-Side View of Matched Tables Across Independent Runs
Generic Comparison: Side-by-side view of Independent Tables Across Matched Runs
P.S. Compare Across Examples by Unique ID
Below I've logged two tables with an explicit "id" column and used it to join, then group by truth to evaluate precision across test predictions from the now-familiar "baseline" and "double" models. My ids are generated from the sequential numbering of the test image batches, then the numbering of the images in each batch.
This join-on-ids mode is especially useful to compare images, audio, or other generated output from different models—here you'll see that the images are identical across ids (since they're the test dataset of MNIST). Note that the guess and truth column show histograms instead of numbers—this is because I logged my class labels as ints and why I recommend logging them as easier-to-read strings.
In the grouped view, I've removed this image column so you can see the contrast in prediction distributions right away: "double" outperforms "baseline" much more obviously in this comparison, based on the near-perfect peaks of correct guesses for purple compared to the broad, equally-likely distribution of green in the "guess" column. To me, this suggests some other difference between the models, relative to how I ran them in the rest of the project—perhaps most of the issue is due to this being an unfair comparison. As you can see from the run set below the Tables, the "double" model actually trains with a batch size of 64, while the "baseline" has a batch size of 32.
Add a comment