Introduction

This is a walkthrough of dataset and prediction visualization using Tables and Artifacts for image classification on W&B. Specifically, we'll finetune a convnet in Keras on photos from iNaturalist 2017 to identify 10 classes of living things (plants, insects, birds, etc). This is a tiny glimpse of how Tables can facilitate deep exploration and understanding of your data, models, predictions, and experiment progress, and you can find a variety of other examples here.

Interact with any Table

All the Tables in this report are fully interactive for you to explore (detailed instructions here). Filter the rows using the expression editor in the top left. Sort and group by the values in a column: hover over the header, and click on the three-dot menu on the right to select an action. The "Reset & Automate Columns" button will return each Table to its default state as originally logged, and refreshing the whole page will reset to the report's intended configuration.

Project workflow

Follow along in this colab →

  1. Upload raw data
  2. Create a balanced split (train/val/test)
  3. Train model and validate predictions
  4. Run inference & explore results
We’ll start by uploading our raw data, then split that data in train, validation, and test sets before spending the bulk of our time digging into training our model training, validating predictions, running inference, and exploring our results. In Artifacts, we can see the connections between our datasets, predictions, and models. Our workflow looks like this:

1. Upload raw data

Step 1 in this colab →
The full iNaturalist dataset contains over 675,000 images. For this project, I work with a more manageable subset of 12,000 images as my "raw" data, from which I take further subsets for train/validation/testing splits. Each data subset is organized into 10 subfolders. The name of each subfolder is the ground truth label for the images it contains (Amphibia, Animalia, Arachnida...Reptilia). With Artifacts, I can upload my dataset and automatically track and version all the different ways I may subsequently decide to generate my train/val/test splits (how many items per split or per class, balanced or unbalanced, hold out test set or no, etc).

2. Create a train/val/test split

Step 2 in this colab →
Starting from the raw dataset, we create several training/validation/test splits (80%/10%/10% each time), starting with a tiny 100-image dataset as a proof of concept and building up to more meaningful sizes. Each time, we can use Artifacts to snapshot the contents of each split and Tables to visualize the details: confirm the distribution across true labels (whether we're training on balanced or imbalanced data), view the sizes of the splits, etc.

Verify data distribution: Group by "split"

Now, we'll confirm the data distribution across labels for each split. Here I have 400 images for each label in train, 50 in val, and 50 in test (scroll down in each cell in the "label" column to confirm).
Explore the dataset interactively →

Preview the images: Group by "label"

Next, we'll group by "label" to see all the images by their true class.

Here's the Colab once more →

3. Train models and validate their predictions

Now we're ready to train some models! As a quick and simple example, we use a pre-trained Inception-V3 network and fine-tune it with a fully-connected layer of variable size. Follow along with a simple example in Step 3 of this Colab notebook →
After every epoch of training, we log the predictions on the validation dataset to a Table. We can experiment with different hyperparameters—the size of the tuning layer, the learning rate, the number of epochs, etc—and compare the validation predictions across model variants. First, we'll look at a few possible ways to analyze one model's performance: the recall and precision across classes, focusing on the most confusing classes, and examining the hard negatives within those. We'll use a few different models and training regimes across these sections, specified in the run set (gray tab below each panel).

Check model recall or false negatives: Group by "truth"

Next, let's see the distribution of predictions and confidence scores for a given correct class. In the Table below, you can scroll vertically through all 10 classes, page through the images in each row using arrows, and scroll right to see distributions for the remaining classes. Some interesting patterns emerge in this example as we look at the histograms of predictions for a given true class (the "guess" column) and the corresponding images from the validation set:
To recreate this view from a default Table, group by the "truth" column. You can optionally sort by truth for a stable row ordering (alphabetical by class label).

Check model precision or false positives: Group by "guess"

When the model guesses a particular class, what is the distribution of true labels for those guesses? In this variant, we again see that "Mollusks" are a popular confound for "Animals" (second row, "truth" column). Interestingly they're also the top confound for "Fungi". Scrolling through some of the images, perhaps snails shells on brown backgrounds or the bright colors of sea slugs against a dark sea are easily confused for mushrooms?
To recreate this view, group by (and optionally sort by) "guess".

Focus on a subset of classes: Filter by true label

Let's look at just the animals, insects, and mollusks. Crustaceans and slugs are especially confusing in this dataset because of the visual context: the model may be picking up on common backgrounds (underwater, tide pools, grass) or hands (frequently holding the smaller creatures).
Different model variants will yield a range of averages for the incorrect label (which is more representative if you additionally filter down to the model's mistakes, row["truth"] != row["guess"]). You can explore this relationship by toggling individual models on/off in the runset below the panels. The runs listed under "Model variants by finetuning layer size" are experiments setting the fully-connected finetuning layer of the model to different sizes: 256, 512, 1024, or 2048. Everything else about the training regime stays constant: InceptionV3 base, 4K training and 500 validation images, 5 epochs, etc—which you can confirm by expanding the run table and scrolling right. From an initial analysis across variants, animals tend to be miscategorized as mollusks more often than mollusks as animals (slightly higher score_Mollusca.avg for Animalia than score_Animalia.avg for Mollusca). One outlying variant (the peach run, "iv3 fc 256 adam") does use Adam instead of rmsprop as the optimizer and seems to make more mistakes on Animalia (more false positives in Mollusca and Arachnida than in the rmsprop equivalent model).

Focus on the confounding images: Filter by "guess"

Let's focus on one more class like Amphibia. You can modify this Table to pick a different class.

Compare results across multiple model versions

The examples so far showcase the predictions of a single model. As you may have discovered by toggling run visibility in an earlier section , you can compare results across two or more runs if they log Tables to the same key. Let's walk through some of the options for model comparison in more detail: the default view joined on id or a specified column, joining Tables by concatenation, aggregating on specific columns, and indexing into models with a side-by-side view.

Default Workspace view shows Tables merged across multiple runs

This panel shows a run training on 50 images per class (green, 1, bottom bar in each cell) and 400 images per class (purple, 0, top bar in each cell). This Table is joined on the unique hash of the image file, letting us compare precisely across any images in the intersection of the validation data sets.
In the default comparison view for multiple runs, each row stacks the values across model variants for evaluation at-a-glance. For each image, we can see both models' guesses and histograms for the confidence scores for all possible labels. By default, a Table will join on an "id" column if it's available, or on the image file hash otherwise. You can change the join key and merge strategy via the gear icon in the top right corner, including the type of join (inner vs outer). If you change the merge strategy from "joining" to "concatenating", Table rows across runs will simply be appended into one long list.

Focus on specific images: Sort by derived metrics

To focus on how predictions change for individual images, you can add new metrics as columns and sort by their values. For example, in this flat view we can see the candidate "amphibians" on which the model variants disagreed the most, as quantified by the standard deviation in confidence score for Amphibia.

Aggregate model recall: Group by "truth"

When we aggregate by "truth", the differences between the Tiny training scheme (1000 images total, 800 train and 100 val) and the Medium version (5000 images total, 4000 train and 500 val) become more apparent. With the tiny 800-image dataset, the model seems to overfit on Amphibia and guess that class very frequently. The 4000-image dataset leads to more reasonable prediction distributions—the histograms along the diagonal of true class vs confidence score for that class are especially illustrative. Even with more training data, the model continues to misclassify some mollusks as animals (and to a lesser extent as mushrooms; see the Mollusca row). Fungi also become a stronger confound for plants as more data is added, shifting from the more-random Amphibia, Mammalia, and Arachnida classes for the Tiny model.
To configure the view above:

Dynamically query and explore model variants

Tables offer a powerful and flexible way of dynamically querying your data. You can toggle the visibility of 5 different variants below, which differ only in the size of their last fully-connected fine-tuning (FC) layers.(and in one case the optimizer). The first panel shows prediction distributions for fixed true labels—the smaller the FC size, the more these distributions seem to peak at the correct prediction. The second panel groups by guess and filters out correct answers, letting us focus on the errors, which seem to be fewer with lower FC size—perhaps the baseline size of 1024 was overfitting. The final panel keeps this filter but splits the images into vertical sections by model, showing which specific images each model variant confuses. The green "fc 256" variant is slightly more accurate than the blue "fc 1024" baseline. However, the truth distributions show a more complex picture, with more frequent mistakes and wider distributions of confused classes for Amphibia, Animalia, Arachnida, and Mollusca specifically in the smaller "fc 256" model versus the larger "fc 1024" one. The balance of regularization and overfitting could be tuned further.

Focus on confusing classes: Filter by mislabeled images

Try a filter query like row["truth"] = "Plantae" and row["guess"] != "Plantae" for multiple panels side my side. You can select any of the other class names (Reptilia, Fungi, etc) to compare error counts and individual images across models. Select a model in a panel by typing its index after the Table key. This section shows three CNN experiments on the size of the fully-connected fine-tuning (FC) layer, from left to right: FC 512 [0], FC 1024 [2] and FC 2048 [1]. The middle column shows the FC 1024 version confuses the fewest plants, and halving or doubling the fine-tuning layer leads to a longer tail of errors.
A side-by-side view of hard negative Plantae, after 1 epoch (left) vs 5 epochs of training (right).

Advanced: Artifacts view

View the comparison in the Artifacts context: Left 1 epoch, right 5 epochs
If your workflow depends on versioning a dataset or model and referencing/downloading/using it in future experiment runs, you can also add a Table to an Artifact and visualize comparisons from the Artifacts view. Here I compare two versions of my model, one trained for a single epoch (orange bars) and one trained for five epochs (blue bars). You can see that performance and confidence generally improves with more training.
To compare versions from any model artifact view:

4. Run inference and explore results

Follow along in Step 4 of the Colab →
Model variants trained and saved with Artifacts are easy to load and test on specific data splits. Below you can select a few variants to compare with stacked histograms (top panel) or side-by-side across runs (bottom two panels). In many scenarios, we won't have labels for test data, but we do here for illustration purposes. Note that we're now switching to view predictions on test data instead of validation data.

Aggregate by true class or focus on specific hard negatives

The default view shows predictions after 5 epochs of training (left, mint bars) vs 1 epoch (right, magenta bars). The guess and score distributions generally become more peaked in the correct label with more training. The side-by-side vertical sections show which images were misclassified by which model. These are very similar, with the biggest shift over 4 additional epochs of training happening in Mammalia (from 31 errors to only 9) and Fungi (22 to 6).
Changing the size of the fine-tuning layer doesn't seem to have an obvious impact on per-class performance, but you can explore different model variants by toggling the "eye" icons in the "Inference variants" run set below.

Filter to a subset of classes

For simplicity, the analysis below keeps a model variant trained for 1 epoch (left) and 5 epochs (right) throughout.
With more epochs, there is less confusion among mollusks, animals, and insects (filter query: row["truth"] = "Mollusca" or row["truth"] = "Animalia" or row["truth"] = "Insecta")

See most confused classes: Group by guess

After longer training, the model generally makes fewer mistakes. However the "Fungi" mistakes on the right (right column, fourth row down) are new and interesting—perhaps this is the effect of the background/overall shape?

Compare across individual images

See how the class predictions for the same images change with more training—generally more confident on the right.

Focus on top confused images for a particular (truth, guess) pair

Optionally filter by labels, then sort by score.

Interesting finds

Context is everything

Here are some Plantae from the dataset which the models (L: 1 epoch, R: 5 epochs) failed to identify as plants. It looks like the background/visual context of the living thing might influence the prediction. In the last row on the left, the phots of field/forest scenes are more canonical for images of mammals. In the first row on the right, the bare earth is more typical context for Fungi photos. And the pitcher plant reptile in the bottom right definitely fooled me.

False positives as evolutionary advantage

The "eyespot" patterns on the butterfly in the third row image look amazingly like the face of a reptile or snake (you may need to zoom in on this page to see the nostril spots). There's some evidence that such coloration evolved to discourage predators. Note that the confidence scores for the last image are almost evenly split across Insecta, Reptilia, Amphibia, and even Arachnida (a bit lower for this last one). If we wanted to ship this model to production, adding a minimum score threshold for detecting any particular class would filter out particularly confusing images like this one.

Add your own?

If you find any interesting insights or patterns in our interactive example, please comment below—we'd love to see them!