Tables Tutorial: Visualize Data for Image Classification
How to version and interactively explore data and predictions across train/val/test with W&B's new Tables feature
Created on December 11|Last edited on November 9
Comment
Introduction
This is a walkthrough of dataset and prediction visualization using Tables and Artifacts for image classification on W&B. Specifically, we'll finetune a convnet in Keras on photos from iNaturalist 2017 to identify 10 classes of living things (plants, insects, birds, etc). This is a tiny glimpse of how Tables can facilitate deep exploration and understanding of your data, models, predictions, and experiment progress, and you can find a variety of other examples here.
Baseline IV3, 10K train/1K val, 5 epochs
1
Interact with any Table
All the Tables in this report are fully interactive for you to explore (detailed instructions here). Filter the rows using the expression editor in the top left. Sort and group by the values in a column: hover over the header, and click on the three-dot menu on the right to select an action. The "Reset & Automate Columns" button will return each Table to its default state as originally logged, and refreshing the whole page will reset to the report's intended configuration.
Project workflow
Follow along in this colab →
- Upload raw data
- Create a balanced split (train/val/test)
- Train model and validate predictions
- Run inference & explore results
We’ll start by uploading our raw data, then split that data in train, validation, and test sets before spending the bulk of our time digging into training our model training, validating predictions, running inference, and exploring our results. In Artifacts, we can see the connections between our datasets, predictions, and models. Our workflow looks like this:

1. Upload raw data
The full iNaturalist dataset contains over 675,000 images. For this project, I work with a more manageable subset of 12,000 images as my "raw" data, from which I take further subsets for train/validation/testing splits. Each data subset is organized into 10 subfolders. The name of each subfolder is the ground truth label for the images it contains (Amphibia, Animalia, Arachnida...Reptilia). With Artifacts, I can upload my dataset and automatically track and version all the different ways I may subsequently decide to generate my train/val/test splits (how many items per split or per class, balanced or unbalanced, hold out test set or no, etc).

2. Create a train/val/test split
Starting from the raw dataset, we create several training/validation/test splits (80%/10%/10% each time), starting with a tiny 100-image dataset as a proof of concept and building up to more meaningful sizes. Each time, we can use Artifacts to snapshot the contents of each split and Tables to visualize the details: confirm the distribution across true labels (whether we're training on balanced or imbalanced data), view the sizes of the splits, etc.
Verify data distribution: Group by "split"
Now, we'll confirm the data distribution across labels for each split. Here I have 400 images for each label in train, 50 in val, and 50 in test (scroll down in each cell in the "label" column to confirm).

Preview the images: Group by "label"
Next, we'll group by "label" to see all the images by their true class.

Here's the Colab once more →
3. Train models and validate their predictions
Now we're ready to train some models! As a quick and simple example, we use a pre-trained Inception-V3 network and fine-tune it with a fully-connected layer of variable size. Follow along with a simple example in Step 3 of this Colab notebook →
After every epoch of training, we log the predictions on the validation dataset to a Table. We can experiment with different hyperparameters—the size of the tuning layer, the learning rate, the number of epochs, etc—and compare the validation predictions across model variants. First, we'll look at a few possible ways to analyze one model's performance: the recall and precision across classes, focusing on the most confusing classes, and examining the hard negatives within those. We'll use a few different models and training regimes across these sections, specified in the run set (gray tab below each panel).
Check model recall or false negatives: Group by "truth"
Next, let's see the distribution of predictions and confidence scores for a given correct class. In the Table below, you can scroll vertically through all 10 classes, page through the images in each row using arrows, and scroll right to see distributions for the remaining classes. Some interesting patterns emerge in this example as we look at the histograms of predictions for a given true class (the "guess" column) and the corresponding images from the validation set:
- In the second row, many of the Animalia are confused for Mollusca. We can see several sea creatures in these images—sea urchins, crabs, starfish, even anemones and jellyfish if you scroll. These are not mollusks according to biological nomenclature, but they live and are photographed in very similar environments. Also, it is an artifact of this dataset that Animalia and Mollusca are two distinct labels, even though "animals" is a higher-level category which includes mollusks. Here, Animalia apparently includes many sea creatures which are not mollusks. Animalia is also the most frequent wrong guess for mollusks (scroll to Mollusca and look at the guess distribution).
- Insecta are the strongest confound for Arachnida—this is a common confusion for humans as well, thinking of small many-legged crawling spiders as insects instead of, technically, "arachnids"
- Plantae are the most frequent wrong guess for Insecta, and also for Fungi. Based on the images in the Insecta row, this could be because insects are frequently capture on/in/next to plants, especially butterflies on flowers.
- Some Reptilia are mistaken for Amphibia—this again is a tricky distinction for humans (is this a newt, living partially in water, or a lizard?)
To recreate this view from a default Table, group by the "truth" column. You can optionally sort by truth for a stable row ordering (alphabetical by class label).
Baseline IV3, 10K train/1K val, 5 epochs
1
Check model precision or false positives: Group by "guess"
When the model guesses a particular class, what is the distribution of true labels for those guesses? In this variant, we again see that "Mollusks" are a popular confound for "Animals" (second row, "truth" column). Interestingly they're also the top confound for "Fungi". Scrolling through some of the images, perhaps snails shells on brown backgrounds or the bright colors of sea slugs against a dark sea are easily confused for mushrooms?
To recreate this view, group by (and optionally sort by) "guess".
Baseline IV3, 4K train/500 val, 5 epochs
1
Focus on a subset of classes: Filter by true label
Let's look at just the animals, insects, and mollusks. Crustaceans and slugs are especially confusing in this dataset because of the visual context: the model may be picking up on common backgrounds (underwater, tide pools, grass) or hands (frequently holding the smaller creatures).
- Click "Filter" to enter a query selecting the true labels of interest. Type and choose hints in the dropdown to formulate an expression like row["truth"] = "Mollusca" or row["truth"] = "Animalia" or row["truth"] = "Insecta"
- Sort and group by "truth".
- Edit the first three score columns via the header menu to rearrange them and prioritize the score columns for the classes you've just selected: Animalia, Insecta, and Mollusca. You can also add columns to view the average confidence score for a label combination across models (e.g. score_Animalia.avg).
Different model variants will yield a range of averages for the incorrect label (which is more representative if you additionally filter down to the model's mistakes, row["truth"] != row["guess"]). You can explore this relationship by toggling individual models on/off in the runset below the panels. The runs listed under "Model variants by finetuning layer size" are experiments setting the fully-connected finetuning layer of the model to different sizes: 256, 512, 1024, or 2048. Everything else about the training regime stays constant: InceptionV3 base, 4K training and 500 validation images, 5 epochs, etc—which you can confirm by expanding the run table and scrolling right. From an initial analysis across variants, animals tend to be miscategorized as mollusks more often than mollusks as animals (slightly higher score_Mollusca.avg for Animalia than score_Animalia.avg for Mollusca). One outlying variant (the peach run, "iv3 fc 256 adam") does use Adam instead of rmsprop as the optimizer and seems to make more mistakes on Animalia (more false positives in Mollusca and Arachnida than in the rmsprop equivalent model).
Model variants by finetuning layer size
2
Focus on the confounding images: Filter by "guess"
Let's focus on one more class like Amphibia. You can modify this Table to pick a different class.
- Filter for the misclassified images (here, Amphibians) with a query like row["truth"] = "Amphibia" and row["guess"] != "Amphibia"
- Sort by the corresponding confidence score (here, score_Amphibia) in descending order to show the closest guesses: where was the model most confident that the creature was an Amphibian, but ultimately wrong? (left panel below). You can also try the reverse sorting to see the biggest mistakes: where was the model most confident that the creature was NOT an Amphibian (alternatively, most confident that the photo was of some other class)?
- Group by guess to try to see systematic errors per class (right panel below). Mollusca is the class this model incorrectly guesses most often for Amphibia, including an interesting perspective on a frog that looks exactly like a clam shell.
Baseline IV3, 10K train/1K val, 5 epochs
1
Compare results across multiple model versions
The examples so far showcase the predictions of a single model. As you may have discovered by toggling run visibility in an earlier section , you can compare results across two or more runs if they log Tables to the same key. Let's walk through some of the options for model comparison in more detail: the default view joined on id or a specified column, joining Tables by concatenation, aggregating on specific columns, and indexing into models with a side-by-side view.
Default Workspace view shows Tables merged across multiple runs
This panel shows a run training on 50 images per class (green, 1, bottom bar in each cell) and 400 images per class (purple, 0, top bar in each cell). This Table is joined on the unique hash of the image file, letting us compare precisely across any images in the intersection of the validation data sets.
In the default comparison view for multiple runs, each row stacks the values across model variants for evaluation at-a-glance. For each image, we can see both models' guesses and histograms for the confidence scores for all possible labels. By default, a Table will join on an "id" column if it's available, or on the image file hash otherwise. You can change the join key and merge strategy via the gear icon in the top right corner, including the type of join (inner vs outer). If you change the merge strategy from "joining" to "concatenating", Table rows across runs will simply be appended into one long list.
Tiny and Medium IV3 variants
2
Focus on specific images: Sort by derived metrics
To focus on how predictions change for individual images, you can add new metrics as columns and sort by their values. For example, in this flat view we can see the candidate "amphibians" on which the model variants disagreed the most, as quantified by the standard deviation in confidence score for Amphibia.
Tiny and Medium IV3 variants
2
Aggregate model recall: Group by "truth"
When we aggregate by "truth", the differences between the Tiny training scheme (1000 images total, 800 train and 100 val) and the Medium version (5000 images total, 4000 train and 500 val) become more apparent. With the tiny 800-image dataset, the model seems to overfit on Amphibia and guess that class very frequently. The 4000-image dataset leads to more reasonable prediction distributions—the histograms along the diagonal of true class vs confidence score for that class are especially illustrative. Even with more training data, the model continues to misclassify some mollusks as animals (and to a lesser extent as mushrooms; see the Mollusca row). Fungi also become a stronger confound for plants as more data is added, shifting from the more-random Amphibia, Mammalia, and Arachnida classes for the Tiny model.
Tiny and Medium IV3 variants
2
To configure the view above:
- In the run set, mark visible only the runs you wish to compare.
- Join on image: in the top right corner, click the gear icon and modify the join key to "image".
- To show truth as a single label: from the Column settings for "truth", change the cell expression from row["truth"] to row["truth"][0] to view and sort individual labels instead of arrays (since the truth is identical across model variants).
- Optionally sort by truth.
Dynamically query and explore model variants
Tables offer a powerful and flexible way of dynamically querying your data. You can toggle the visibility of 5 different variants below, which differ only in the size of their last fully-connected fine-tuning (FC) layers.(and in one case the optimizer). The first panel shows prediction distributions for fixed true labels—the smaller the FC size, the more these distributions seem to peak at the correct prediction. The second panel groups by guess and filters out correct answers, letting us focus on the errors, which seem to be fewer with lower FC size—perhaps the baseline size of 1024 was overfitting. The final panel keeps this filter but splits the images into vertical sections by model, showing which specific images each model variant confuses. The green "fc 256" variant is slightly more accurate than the blue "fc 1024" baseline. However, the truth distributions show a more complex picture, with more frequent mistakes and wider distributions of confused classes for Amphibia, Animalia, Arachnida, and Mollusca specifically in the smaller "fc 256" model versus the larger "fc 1024" one. The balance of regularization and overfitting could be tuned further.
Model variants by FC size
5
Focus on confusing classes: Filter by mislabeled images
Try a filter query like row["truth"] = "Plantae" and row["guess"] != "Plantae" for multiple panels side my side. You can select any of the other class names (Reptilia, Fungi, etc) to compare error counts and individual images across models. Select a model in a panel by typing its index after the Table key. This section shows three CNN experiments on the size of the fully-connected fine-tuning (FC) layer, from left to right: FC 512 [0], FC 1024 [2] and FC 2048 [1]. The middle column shows the FC 1024 version confuses the fewest plants, and halving or doubling the fine-tuning layer leads to a longer tail of errors.
CNN experiments, 10K train / 1K val, 5 epochs
3

A side-by-side view of hard negative Plantae, after 1 epoch (left) vs 5 epochs of training (right).
Advanced: Artifacts view
If your workflow depends on versioning a dataset or model and referencing/downloading/using it in future experiment runs, you can also add a Table to an Artifact and visualize comparisons from the Artifacts view. Here I compare two versions of my model, one trained for a single epoch (orange bars) and one trained for five epochs (blue bars). You can see that performance and confidence generally improves with more training.
To compare versions from any model artifact view:
- Using the left sidebar, select one version of the model, then hover over another version and click "Compare" to select it.
- Sort by "truth", group by "truth", and otherwise use the Table functionality as above—now you can compare the prediction distributions across the two models.

4. Run inference and explore results
Model variants trained and saved with Artifacts are easy to load and test on specific data splits. Below you can select a few variants to compare with stacked histograms (top panel) or side-by-side across runs (bottom two panels). In many scenarios, we won't have labels for test data, but we do here for illustration purposes. Note that we're now switching to view predictions on test data instead of validation data.
Aggregate by true class or focus on specific hard negatives
The default view shows predictions after 5 epochs of training (left, mint bars) vs 1 epoch (right, magenta bars). The guess and score distributions generally become more peaked in the correct label with more training. The side-by-side vertical sections show which images were misclassified by which model. These are very similar, with the biggest shift over 4 additional epochs of training happening in Mammalia (from 31 errors to only 9) and Fungi (22 to 6).
Changing the size of the fine-tuning layer doesn't seem to have an obvious impact on per-class performance, but you can explore different model variants by toggling the "eye" icons in the "Inference variants" run set below.
Inference variants
2
Filter to a subset of classes
For simplicity, the analysis below keeps a model variant trained for 1 epoch (left) and 5 epochs (right) throughout.
With more epochs, there is less confusion among mollusks, animals, and insects (filter query: row["truth"] = "Mollusca" or row["truth"] = "Animalia" or row["truth"] = "Insecta")

See most confused classes: Group by guess
After longer training, the model generally makes fewer mistakes. However the "Fungi" mistakes on the right (right column, fourth row down) are new and interesting—perhaps this is the effect of the background/overall shape?

Compare across individual images
See how the class predictions for the same images change with more training—generally more confident on the right.

Focus on top confused images for a particular (truth, guess) pair
Optionally filter by labels, then sort by score.

Interesting finds
Context is everything
Here are some Plantae from the dataset which the models (L: 1 epoch, R: 5 epochs) failed to identify as plants. It looks like the background/visual context of the living thing might influence the prediction. In the last row on the left, the phots of field/forest scenes are more canonical for images of mammals. In the first row on the right, the bare earth is more typical context for Fungi photos. And the pitcher plant reptile in the bottom right definitely fooled me.

False positives as evolutionary advantage
The "eyespot" patterns on the butterfly in the third row image look amazingly like the face of a reptile or snake (you may need to zoom in on this page to see the nostril spots). There's some evidence that such coloration evolved to discourage predators. Note that the confidence scores for the last image are almost evenly split across Insecta, Reptilia, Amphibia, and even Arachnida (a bit lower for this last one). If we wanted to ship this model to production, adding a minimum score threshold for detecting any particular class would filter out particularly confusing images like this one.

Add your own?
If you find any interesting insights or patterns in our interactive example, please comment below—we'd love to see them!
Add a comment
This is amazing!!!!!
I think it would be a bit better at full width.
Reply
Wow Stacey this is stunning. It looks like Tables generalizes incredibly well across a wide set of questions on data.
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.