Author: Sayak Paul
In this article, we’ll review and compare a plethora of weight initialization methods for neural nets. We will also outline a simple recipe for initializing the weights in a neural net.
A neural net can be viewed as a function with learnable parameters and those parameters are often referred to as weights and biases. Now, while starting the training of neural nets these parameters (typically the weights) are initialized in a number of different ways - sometimes, using contant values like 0’s and 1’s, sometimes with values sampled from some distribution (typically a unifrom distribution or normal distribution), sometimes with other sophisticated schemes like Xavier Initialization.
The performance of a neural net depends a lot on how its parameters are initialized when it is starting to train. Moreover, if we initialize it randomly for each runs, it’s bound to be non-reproducible (almost) and even not-so-performant too. On the other hand, if we initialize it with contant values, it might take it way too long to converge. With that, we also eliminate the beauty of randomness which in turn gives a neural net the power to reach a covergence quicker using gradient-based learning. We clearly need a better way to initialize it.
Careful initialization of weights not only helps us to develop more reproducible neural nets but also it helps us in training them better as we will see in this article. Let’s dive in!
We are going to study the effects of the following weight initialization schemes:
Finally, we are going to see the effects of the default weight initialization scheme that comes with tf.keras.
To make the experiments quick and consistent let’s fixate on the dataset and a simple model architecture. For doing experiments like this, my favorite dataset to start off with is the FashionMNIST dataset. We will be using the following model architecture:
The model would take a flattened feature vector of shape (784, ) and after passing through a set of dropout and dense layers, it would produce a prediction vector of shape (10, ) which correspond to the probabilities of the 10 different classes present in the FashionMNIST dataset.
This would be the model architecture we will be using for the all experiments. We will be using the sparse_categorical_crossentropy as the loss function and the Adam optimizer.
Let’s first throw a weight vector of all zeros to our model and see how it performs in 10 epochs of training. In tf.keras, layers like Dense, Conv2D, LSTM have two arguments - kernel_initializer and bias_initializer. This is where we can pass in any pre-defined initializer or even a custom one. I would recommend you to take a look at this documentation which enlists all the available initializers in tf.keras.
We can set the kernel_initializer arugment of all the Dense layers in our model to zeros to initialize our weight vectors to all zeros. Since the bias is a scalar quantity, even if we set it to zeros it won’t matter that much as it would for the weights. In code, it would look like so:
tf.keras.layers.Dense(256, activation='relu', kernel_initializer=init_scheme, bias_initializer='zeros')
If you think from a mathematical view point, a neural net is nothing but a chain of functions applied on top of each other. In these functions, we generally multiply an input vector with a weight vector and add a bias term to the product vector (think of broadcasting). We then pass the final vector through an activation function and then proceed from there.
Ideally we would want the values of the weight vector to be in such a way that they do not end up in causing a data loss in the input vector. Ultimately we are multiplying the weight vector with the input vector, so we need to be very careful. So, it’s often a good practice to keep the values of the weight vector to be as small as possible but not very small so that they end up causing numerical instabilities.
In the earlier experiments, we saw that initializing our model with constant values is not a good idea. So, let’s try initializing them with unique small numbers having [0,1] range. We can do this by sampling values from a [uniform distribution](uniform distribution). A uniform distribution looks like so:
A uniform distribution within [-5, 5] range
Here’s the catch with uniform distributions - the values from a uniform distribution have the equal chance of being sampled.
Initializing a tf.keras Dense layer with a uniform distribution is a bit more involved than the previous two schemes. We would make use of the
tf.keras.initializers.RandomUniform(minval=min_val, maxval=max_val, seed=seed) class here. In this case, we would be supplying 0 as the minval and 1 as the maxval. seed could be any integer of your choice. Let’s see how it performs!
As we saw in the previous experiment that having some randomness when initializing the weights in a neural net can clearly help. But could we control this randomness and provide some meaningful information to our model? What if we could pass some information about the inputs we would feed to the model and have the weights somehow dependent on that?
We can do this! The following rule (from Udacity’s lesson on Weight Initialization) helps us in doing so:
So, what we would do is instead of sampling values from a uniform distribution of [0,1] range, we would replace the range with [-y,y]. We have got a number of ways in which we could do this in tf.keras but I found the following way to be more customizable and more readable.
# iterate over the layers of a given model for layer in model.layers: # check if the layer is of type `Dense` if isinstance(layer, tf.keras.layers.Dense): # shapes are important for matrix mult shape = (layer.weights.shape, layer.weights.shape) # determine the `y` value y = 1.0/np.sqrt(shape) # sample the values and assign them as weights rule_weights = np.random.uniform(-y, y, shape) layer.weights = rule_weights # weights layer.weights = 0 # bias
Let’s start with why - why use normal distribution here? Earlier, I mentioned that smaller weight values might be better for a network to train well. Now, in order to keep these initial weight values close to 0 normal distribution would be better suited than uniform distribution since in a uniform distribution, there is an equal probability for a number to get sampled. But for a normal distribution, that’s not the case. We would take a normal distribution having a mean of 0 and the standard deviation would be set to y.
As can be seen in the following figure (which mimics a normal distribution) most of the values would be concentrated in the mean value region. In our case, this mean value would be 0 so, it might work as we are thinking.
A sample normal distribution
The code for initializing the weights with this scheme would be pretty much similar, we are going to swap the uniform rule with a normal one -
# iterate over the layers of a given model for layer in model.layers: # check if the layer is of type `Dense` if isinstance(layer, tf.keras.layers.Dense): # shapes are important for matrix mult shape = (layer.weights.shape, layer.weights.shape) # determine the `y` value y = 1.0/np.sqrt(shape) # sample the values and assign them as weights rule_weights = np.random.normal(0, y, shape) layer.weights = rule_weights # weights layer.weights = 0
Let’s now see how different initialization methods affect the parameters of our network as it trains. Let’s take the uniform initialization (with [0,1] range) scheme first. TensorBoard (a tool by the TensorFlow team for visualizing and debugging machine learning models) allows us to visualize the learned parameters of a model in histograms and distributions. We will stick to histograms for this article.
Histograms of the learned parameters of our network visualized in TensorBoard.
Histograms represent different brackets of values with respect to their occurences. In the above figure, we can see that most of the weights across the different layers are well spread out across the range of [0,1]. Here are the histograms of our network initialized with the uniform distribution but with the recipe:
Histograms of the learned parameters of our network initialized with a uniform distribution but with the recipe.
We can clearly notice that when our network is initialized with the constrained uniform distribution the dispersion in the weight distribution is less and most of the values are are closer to zero which is what we wanted.
I encourage you to try out this observation with the other methods we discussed. Weights and Biases makes it extremely easy to sync up your TensorFlow event files so that you can host TensorBoard instances in your Weights and Biases run page itself. I will not go into the code for this portion in this article but you if you're interested, check out the colab notebook →.
Thanks for sticking with me throughout the article. The study of weight initialization in neural nets is indeed very interesting to me as it plays a significant role in training them better. As a fun exercise, you might also see what is the default initializers in tf.keras when it comes to the Dense layers and compare the results to the ones shown in this article.
By now, you should have a mindset that would help you to systematically investigate why your neural net might not be training well. Practically, there can be a lot of reasons for that but weight initialization is definitely one of them. You now have a list of goto initializers that you would experiment with.
Also, I wanted to share some very good references which you can look up if you are interesting in studying more about the topic. This study gained popularity with seminal paper named Understanding the difficulty of training deep feedforward neural networks by Xavier et al. It was then studied quite well by Kaiming et al. in their paper Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. That same year, Dmytro et al. published a paper named All you need is a good init where they proposed a very simple yet effective weight initialization scheme called Layer-wise Sequential Unit Variance (LSUV). LSUV has shown tremendous performance improvements on deeper architectures and has easily become a favorite choice among the practitioners. There’s also some study on weight-agnostic neural nets by Adam et al. in their paper Weight Agnostic Neural Networks but it is yet to get the amount of the attention the earlier schemes have got.
Selecting the right combination of weight initialization method and activation function is also an important study and I highly recommend reading this deeplearning.ai article if you are interested to know about it.
I am highly indebted to Jeremy Howard and his team at fast.ai because it was fast.ai’s course Deep Learning from the Foundations (taught by Jeremy himself) which triggered my interest to study the topic of weight initialization. If you haven’t checked out the course yet, take the time to do so.
I hope this article gave you a sense how important role a weight initialization method plays for training a neural net. I am excited to see if it helps in improving the performance of your custom neural nets too.