Better Paths Through Idea Space

Stacey Svetlichnaya

Guided deep learning exploration with Weights & Biases

Deep learning papers often present coherent principles leading to elegant hypotheses, a clean experimental design, and eventually convincing results. In practice, progress in deep learning is much more chaotic and relies mostly on fast iteration and tight feedback loops. I worked on large-scale object recognition and image classification at Flickr for four years and recently started at Weights & Biases, which makes wandb—a developer toolkit for deep learning that fits my preferred workflow quite well. When you’re in exploratory hacking mode, wandb helps track your meandering path through experimental space and even direct it to more promising areas. In this post, I will walk through a real example of building and optimizing a computer vision model on a new dataset, including how to instrument a simple Keras model with wandb and some practices that accelerated my progress.

The task: identify plants & animals with deep learning

For my first wandb project, I tried this Keras fine-tuning tutorial on iNaturalist 2017, which contains >675K photos of >5K different species of life, with each species assigned to one of 13 classes (strictly speaking, one of 13 natural taxa).

Different taxa: Insects, mammals, mushrooms
Spot 10 differences? 3 distinct species of wren

I started with the 5-layer convnet from the Keras tutorial: 3 convolutional layers and 2 fully-connected layers with ReLU and max pooling. Since the full iNaturalist 2017 dataset is 186GB and heavily skewed, I generated a more manageable balanced subset of 50,000 images across the 10 most frequent taxa [1]. After confirming the net learns (or at least, overfits a binary classifier on 2000 examples), I fed in a balanced 10% of the data (5000 train, 800 validation) and fiddled with the batch size, number of layers, the size of the last fully-connected layer, and the dropout.

The tiny network and dataset enabled me to move quickly and get a sense for the capabilities of the tool. The best model for predicting one of 10 taxa attained a train/val accuracy of 44.2/45.3%, which increased to 51.6%/48.3% on the full 50K dataset (suggesting some overfitting). This is roughly 5X better than random with only 7 layers. For comparison, fine-tuning an (ImageNet-pre-trained) InceptionV3 net on the 50K dataset raises the accuracy to 95.6% train and 84.5% validation (clear overfitting, as expected—Inception V3 gets to 52.9% out of the box, in large part because these nature photos are very similar to ImageNet).

Map your steps: all the curves in 5 lines of code

Adding wandb to a project once you create an account is easy: two commands to install wandb and authenticate with your API key, three lines to link your script to a web-viewable wandb project, and an instance of WandbCallback passed to your training method to handle logging. With Keras, wandb automatically tracks the necessities for you: epoch-level accuracy and loss for both training and validation. You can launch your script with these minimal changes, navigate to the presented URL, and rejoice in the emerging plot lines. These default charts make it immediately obvious whether the model is learning. They’re also much easier to reason about and compare across runs than the rows of floats I’m usually watching in Keras, and there is no need to write separate metrics-computing or chart-plotting code.

W&B default graphs for Keras: training and validation accuracy and loss

Find clues in your surroundings: focus on the relevant metrics

My survey of hyperparameters for the small CNN was far from principled. Trying to simulate an authentic hacking experience, I changed values and added layers directly in my script and hardcoded different versions on various machines (local CPU, localish GPU, remote GPU). To my undeserved delight, a lot of this information (how many layers, of what size) is recoverable from the “Model” tab for a given run, which automatically shows the name, type (Conv2D, Activation, MaxPooling2D,…), number of parameters (no more mental math!), and output shape for every element in the computational graph. As I discovered network parameters of greater interest, I stored them explicitly as experiment-level settings in wandb.config—a dictionary for user-defined values, such as batch size, number of training examples, learning optimizer type, or anything else expected to stay fixed in a single experiment or single call of your training script—this is what wandb defines as a run. The default summary metrics (train/val acc/loss) and the set of fields specified in wandb.config become the columns stored for a given run. Columns can be used to sort, group, and filter the runs displayed or graphed—or automagically hidden when their values are constant across a selection of runs. This enabled me to refine my attention window and simplified the search for that one weird run where the loss blew up or switching optimizers (to rmsprop) suddenly made everything better.

Showing the relevant config and summary metrics for a subset of runs

Mark promising directions: good handles reduce cognitive load

This iterative loop—find a pattern in some of the runs tried so far; configure the next run to confirm, break, or extend it—still left me holding a lot of temporary values in my short-term memory (the name of the relevant run, the relevant column, the right color line on the right chart…). Giving my runs descriptive names cut down this mental list, especially keeping a prefix name for runs in a series and emphasizing the test variables and values (e.g., from my base network, varying only the dropout d and the batch size b yields run names like “base_d_0.2_b_32”, “base_d_0.4_b_64”, “base_d_0.6_b_128”). Adding command-line argparse for the run name and all the other parameters of interest (which I can pass directly to wandb.config for logging) made my experimentation much more efficient, letting me modify all relevant parameters directly in the launch command. The —quiet_mode or dry run flag is particularly useful when debugging a training script (to reduce noise in the UI).

Gradually, these runs coalesced into insights and more structured experiments, which I began documenting in a wandb report. This is a highly configurable view of your project via data tables, plots, markdown, HTML, and other visualizations. While the project is evergreen, a report is intended as a snapshot of your progress to conveniently recall the details or share them with others. Notes on the day’s experiments, todos, questions, or insights—with accompanying charts as evidence—can help you track intermediate results and remember what to explore next. For example, fine-tuning a well-known, high accuracy convnet pre-trained on ImageNet is a reliable baseline for most vision tasks—but which base model is optimal for this dataset? I compared the loss, train accuracy, and validation accuracy of InceptionV3, InceptionResNetV2, Xception, and ResNet50 on the tiny 5K dataset and found that the Inception variants performed best (and noticed that a Keras warning about ResNet50 was more likely a bug based on the low initial validation accuracy). Since InceptionResNetV2 is double the size/parameter count of InceptionV3, I chose the more compact model going forward and clearly summarized my findings for future reference. Now I can narrow my focus—by ignoring all the other possible base model variants—yet also be better prepared if I happen to come back to this question. I will know to debug the ResNet50 initialization, I will have links with exact config I used for each constituent run, and hopefully I won’t have to deeply reanalyze the associated charts (“why am I using InceptionV3 again? maybe if I try a smaller/deeper/different…oh, that’s right, this is why”).

A succinct insight for my future self, with relevant notes, visual evidence, and even exact run config

In another experiment, I tracked the precision for each of the 10 taxa and finding that Animalia and Plantae were the hardest to predict (5-10% lower than the other labels). This makes sense because animals and plants are technically biological kingdoms, which are larger/higher-level categories than biological classes—they contain more species and are more visually diverse, hence harder to learn from a fixed number of examples.

Another example: my takeaways from tracking per-class precision

In subsequent posts, I will extend these initial trials to the paradigm of curriculum learning, or whether order matters for training data. I will continue to explain helpful wandb features and practices as I discover them. In the meantime, you can check out the relevant code here under keras-cnn-nature.

[1] The full iNaturalist dataset is 186GB, so a convenient 12K subset can be downloaded here, and some tools for generating more manageable subsets via symlinks are here.