Six Ways to Debug a Machine Learning Model

James Le

In traditional software development, a bug usually leads to the program crashing. While this is annoying for the user, it is critical for the developer - when the program fails the developer can inspect errors to understand why.

With machine learning models, the developer sometimes encounters errors but all too often the program crashes without a clear reason why. While these issues can be debugged manually, machine learning models most often fail because of poor output predictions. What’s worse is that when they fail, there is usually no signal about why or when the models failed. And to make the situation even more complicated, it might be a result of a number of things, including bad training data, high loss error, or a lack of convergence rate.

In this blog post, we’ll look at how to debug these silent failures so that they don’t impact the performance of our machine learning algorithms. Here is a quick overview of what we are going to cover:

  1. How to find flaws in the input data
  2. How to make the model learn more from less data
  3. How to prepare data for training and avoid common pitfalls
  4. How to find the optimal model hyper-parameters
  5. How to schedule learning rates in order to reduce overfitting
  6. How to monitor training progress with Weights and Biases

It is worth noting that as a data science/machine learning practitioner, you need to acknowledge that there are many reasons why machine learning projects can fail. Most have nothing to do with the skills of the engineers and data scientists (just because it’s not working does not mean you’re faulty). The takeaway is that if we can identify common pitfalls or bugs early and often, we can save both time and money. In high-stakes application domains such as finance, government, and health care, this will be mission-critical.

1 - How to find flaws in the input data

There are two aspects to consider when wanting to know whether our data is up to the task of training a good model:

To figure out whether our model contains predictive information, we can ask ourselves: can a human make a prediction given this data?

If a human cannot understand an image or a text, chances are our model won’t make much sense of it either. If there is not enough predictive information, adding more inputs to our model won’t make it better; in contrast, the model will overfit and become less accurate.

Once our data has enough predictive information, we need to figure out whether we have enough data to train a model to extract the signal. There are a couple of rules of thumb to follow:

(1) For classification, we should have at least 30 independent samples per class

(2) We should have at least 10 samples for any feature, especially for structured data problems, and finally

(3) The size of our dataset is directly proportional to the number of parameters in our model. These rules might need to be tailored to your specific application. If you can make use of transfer learning, then you can drastically reduce the number of samples needed.

2 -  How to make the model learn more from less data

In many situations, we simply do not have enough data. In this case, one of the best options is to augment the data. Taking augmentation a step further, we can generate our own data with generative models such as autoencoders and generative adversarial networks.

Likewise, we can find external public data, which can be found available on the Internet. Even if the data were not originally collected for our purpose, we can potentially relabel it or use it for transfer learning. We can train a model on a large dataset for a different task and then use that model as a basis for our task. Similarly, we can find a model that someone else has trained for a different task and repurpose it for our task.

It’s important to remember that in most cases the quality of the data trumps the quantity of the data. Having a small, high-quality dataset and training a simple model is the best practice to find problems in the data early in the training process. A mistake many data scientists make is that they spend time and money on getting a big dataset, only to figure out later that they have the wrong kind of data for their project.

3 - How to prepare data for training and avoid common pitfalls

There are 3 common ways to pre-process the data features for the training process:

However we prepare the features, it is important to only measure the scaling factors, the mean, and the standard deviation on the test set. If we measure these factors over the entire dataset, the algorithm might perform better on the test set than it will in production, due to this information exposure.

4 - How to find the optimal model hyperparameters

Manually tuning the hyperparameters of a neural network model can be extremely tedious. That is because there is not a scientific rule to use when it comes to hyper-parameter tuning. That is why many data scientists have moved towards automatic hyperparameter search, using some sort of non-gradient based optimization algorithm.

To see how we can find the optimal hyperparameters for our model with Weight and Biases, let’s take a look at this example from a Mask R-CNN computer vision model. With the goal to implement Mask R-CNN for semantic segmentation tasks, Connor and Trent tweak different hyper-parameters that govern how the model operates: learning rate, gradient clip normalization, momentum, weight decay, scale ratios, weights of various loss functions… They want to know how the semantic segmentation of the images progressed as the model trained with different hyper-parameters, so they integrated an ImageCallback() class to sync to wandb. Furthermore, they included a script to run parameters sweeps that can be adapted to work with different hyper-parameters or different values of the same ones.

Their results can be found on the wandb run page. It seems like a high gradient clipping set and a high learning rate can lead to better model accuracy, with the validation loss scores decreasing quickly given an increasing number of iterations.

5 - How to schedule learning rates in order to reduce overfitting

One of the most important hyperparameters is the learning rate, which is difficult to optimize. Small learning rate leads to slow training, and large learning rate leads to model overfitting.

When it comes to finding a learning rate, standard hyperparameter search techniques are not the best choice. For the learning rate, it is better to perform a line search and visualize the loss for different learning rates, as this will give you an understanding of how the loss function behaves. When doing a line search, it is better to increase the learning rate exponentially. You are more likely to care about the region of smaller learning rates than about very large learning rates.

In the beginning, our model might be far away from the optimal solution, and so because of that, we want to move as fast as possible. As we approach the minimum loss, however, we want to move slower to avoid overshooting. A popular method is to anneal the learning rate over time, which recommends starting with a relatively high learning rate and then gradually lowering the learning rate during training. The intuition is that we would like to move quickly from the initial parameters to a range of good-enough parameter values, and then we can explore the “deeper, but narrower parts of the loss function.” The most popular form of learning rate annealing is a step decay where the learning rate is reduced by some percentage after a set number of training epochs. More generically, we should define a learning rate schedule to update the rate during training according to a specified rule.

6 - How to monitor training progress with Weights and Biases

An important part of debugging a model is knowing when things go wrong before you have invested significant amounts of time training the model. WandB provides a seamless way to visualize and track machine learning experiments. As described in this GitHub repo, you can search/compare/visualize training runs, analyze system usage metrics alongside runs, replicate historic results, and many more.

All we have to do after installing wandb is including this piece of code in our training script:

import wandb
# Your custom arguments defined here
args = ...
wandb.init(config=args, project="my-project")
wandb.config["more"] = "custom"
def training_loop():
   while True:
       # Do some machine learning
       epoch, loss, val_loss = ...
       # Framework agnostic / custom metrics
       wandb.log({"epoch": epoch, "loss": loss, "val_loss": val_loss})

Alternatively, we can integrate with Tensorboard in one line:


TensorBoard is a TensorFlow extension that allows us to easily monitor our model in a browser. To provide an interface from which we can watch the model’s process, TensorBoard also offers some options useful for debugging. For example, we can observe the distributions of the model’s weights and gradients during training. If we really want to dig into the model, TensorBoard offers a visual debugger. In this debugger, we can step through the execution of the TensorFlow model and examine every single value inside it. This is especially useful if we are working on complex models, such as variational auto-encoders, and are trying to understand why complex things break down.


We now have a substantial number of tools that will help us run actual machine learning projects. Making sure that the model works before deploying it is crucial and the failure to do so can cost us a lot of money. Hopefully, this blog post equips you well with practical techniques to make models resilient, generalizable, and easy to debug.

Join our mailing list to get the latest machine learning updates.