Skip to main content

Batch Normalization in Keras - An Example

Implementing Batch Normalization in a Keras model and observing the effect of changing batch sizes, learning rates and dropout on model performance.
Created on July 17|Last edited on March 28


Introduction

In this report, we'll show you how to add batch normalization to a Keras model, and observe the effect BatchNormalization has as we change our batch size, learning rates and add dropout. Adding batch normalization helps normalize the hidden representations learned during training (i.e., the output of hidden layers) in order to address internal covariate shift.

Run example in colab →

1. Add batch normalization to a Keras model

Keras provides a plug-and-play implementation of batch normalization through the tf.keras.layers.BatchNormalization layer. Official documentation here. We add BatchNorm between the output of a layer and it's activation:
# A hidden layer the output.
x = keras.layers.Conv2D(filters, kernel_size, strides, padding, ...)(x)
# BN layer is between output of a layer and it's activation.
x = keras.layers.BatchNormalization()(x)
# Apply activation.
x = keras.activations.relu(x)
A few important parameters to customize the behavior of the BatchNormalization() layer:
  • axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization.
  • momentum: Momentum for the moving average.
  • epsilon: Small float added to variance to avoid dividing by zero.
  • center: If True, add offset of beta to normalized tensor. If False, beta is ignored.
  • scale: If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
  • renorm: Whether to use Batch Renormalization. This adds extra variables during training. The inference is the same for either value of this parameter.

2. Observe the effect of batch normalization

Let us train a baseline model with and without a batch normalization layer. The models were trained with Adam optimizer with exponential learning rate decay scheduler and trained for 25 epochs.
We use Weights & Biases to help us save everything all our model performance metrics automatically. You can see the plots generated by W&B below.
Check out this colab for the full example code.



Run set
2



Observations

  • The learning rate decay helped stabilize the validation curve. However, the one trained with BN oscillates more than the one trained without BN. This can be due to quicker convergence.
  • The Test Error Rate for BN is lower by a margin of ~4% which is really good. Thus the model converges quickly and has higher validation accuracy.
  • The model trained without BN seems to overfit if trained for more epochs.
  • Batch Normalization in general improves training speed for higher learning rates. But for this particular experiment, I have trained with the same initial learning rate of 0.001. But you can see that the model trained with BN converges quickly.

3. Effect of batch size on Batch Normalization

This experiment demonstrates the effect of batch size on Batch Normalization for the training process.

Observations

  • We can see that right off the bat, small batch sizes gave a superior model performance. The test error rate is the lowest for the batch size of 16.
  • The test error rate increases as we increase the batch size and is evident from the bar chart shown above. In my opinion, this is because a bigger batch size makes the computed statistics, i.e., the mean and standard deviation of the training batch, much closer to the population(training set). Thus, the effect of batch normalization diminishes with increasing batch size.
  • The validation curve has some unique property for bigger batch size. I do not have an answer for this behavior and is something to be looked at.



Run set 2
9



4. Effect of learning rate on Batch Normalization

As mentioned earlier, batch normalization allows us to train our models with a higher learning rate, which means our network can converge faster and while still avoiding internal covariate shift.
This experiment showcases this effect. You will be shocked at how the simple idea of normalizing the input of the hidden layers can be so effective.

Observations

  • The model trained without BN with a learning rate of 0.01 performed better with Batch Normalization(BN) layers.
  • The number of trainable parameters for the non-BN model is 37,322, while the BN-based model is 37,834. Thus, only 512 trainable parameters increased, and this drastic change in the learning process is not due to those additional trainable parameters. The point is that the normalization of hidden representations indeed works.
  • The non-BN model's test error rate is 90%, which for a 10-class classification model is random guessing. With the same configuration and added BN layers, the test error rate dropped to 29.1%. This proves that BN works.
  • Thus, with proper learning rate scheduling, we can speed up the training process and achieve better and quicker convergence using batch normalization.



Run set
2



5. Batch Normalization vs Dropout

To prevent models from overfitting, one of the most commonly used methods is Dropout. However, batch normalization also provides a regularization effect, replacing the need for dropout either entirely or partially. We have already seen some positive effects of batch normalization. It would be interesting to compare the performance of batch normalization with dropout.
This experiment will answer these questions:
  • Where should dropout layer be used - after pooling layer or before pooling layer?
  • Can batch normalization improve model generalization by reducing overfitting compared to dropout? How does model performance vary when using batch normalization implemented alongside dropout?
Note: This experiment aims to investigate the effect of batch normalization and dropout only in the Conv block. Also, the results are dataset and model-specific but somewhat generalizable.

Where should Dropout be placed?

I trained a baseline model and then regularized it with dropout layers used in two settings:
  • After the MaxPooling2D layer.
  • Before the Maxpooling2D layer.
The models were trained with early stopping, monitoring val_loss with the patience of 10 epochs and the dropout rate was set to 0.25. The comparative results for both the setting are shown below:



Run set
3



Observations

  • The baseline model quickly overfitted. Both settings of dropout placement helped regularize the model.
  • We see that by placing the dropout layer after the pooling layer, the model could not attain higher training accuracy. TensorFlow applies element-wise dropout, i.e., some neurons are randomly masked by multiplying the activation with zero. However, the total number of physical neurons remains the same. Thus, applying pooling operation over this is the same as applying pooling operation on the conv layer's output. However, when dropout is applied after pooling operation, we are randomly dropping some neurons from already reduced neurons. In such a setting, we have fewer neurons available to train on. The regularization seems stronger, but we do not use the full capacity of the model in this setting.
  • The model with dropout layer placed after the pooling layer generalized better. We can see this from the test error rate.
There is no right or wrong way of using the dropout layer in the convolutional block. Nevertheless, using dropout after the pooling layer significantly reduces model capacity. Batch normalization is highly preferred for conv blocks. We will see this in the next section.

How does model performance compare between dropout and batch normalization?




Run set
4



Observations

  • The model trained with batch normalization performs better than baseline even though we can see that it overfitted after few epochs. Note that the model converged quickly.
  • The batch norm performed better than dropout before the pooling layer. It performed a bit inferior to the dropout when placed after the pooling layer.
  • The reason that the model overfitted and that batch normalization could not regularize that well is mainly because of the use of the Flatten layer. We have a large number of neurons after flattening operation, and we need to regularize, thereby either using a dropout layer or replacing flatten with Global Average Pooling operation.
  • As said earlier, batch normalization can provide required regularization, but it is not guaranteed. However, for a deeper and well-constructed model, dropout is seldom used.

Conclusion

Batch Normalization is a robust technique used widely to train our deep learning models. To summarize the results:
  • BN prevents overfitting and improves generalization.
  • The model converges quickly, and thus training time is reduced.
  • Small batch size works better with batch normalization.
  • BN allows the use of a higher learning rate, which improves training time.
  • BN should be preferred over dropout for conv blocks.
  • BN provides required regularization.
I hope you find this report insightful. If you have any feedback, reach out to me on twitter @ayushthakur0.

Weights & Biases

Weights & Biases helps you keep track of your machine learning experiments. Use our tool to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.
Get started in 5 minutes.

Iterate on AI agents and models faster. Try Weights & Biases today.