Effects of Weight Initialization on Neural Networks
This article explores why weight initialization matters, before providing a comparison of a range of weight initialization methods with Weights & Biases.
Created on March 22|Last edited on November 3
Comment
In this article, we’ll review and compare a plethora of weight initialization methods for neural nets. We will also outline a simple recipe for initializing the weights in a neural net.
Table of Contents
A summary of the different weight initialization methodsWhy Does Weight Initialization Matter?Method 1: Weights initialized to all zerosMethod 2: Weights initialized to all onesMethod 3: Weights initialized with values sampled from a uniform distributionMethod 4: Weights initialized with values sampled from a uniform distribution with a careful tweakMethod 5: Weights Initialized With Values Sampled From a Normal Distribution With a Careful TweakEffects On Training With Careful InitializationSome Tips and Last Thoughts
A summary of the different weight initialization methods
Run set
6
Why Does Weight Initialization Matter?
A neural net can be viewed as a function with learnable parameters and those parameters are often referred to as weights and biases. Now, while starting the training of neural nets these parameters (typically the weights) are initialized in a number of different ways - sometimes, using constant values like 0’s and 1’s, sometimes with values sampled from some distribution (typically a uniform distribution or normal distribution), sometimes with other sophisticated schemes like Xavier Initialization.
The performance of a neural net depends a lot on how its parameters are initialized when it is starting to train. Moreover, if we initialize it randomly for each run, it’s bound to be non-reproducible (almost) and even not-so-performant too. On the other hand, if we initialize it with constant values, it might take it way too long to converge. With that, we also eliminate the beauty of randomness which in turn gives a neural net the power to reach a convergence quicker using gradient-based learning. We clearly need a better way to initialize it.
Careful initialization of weights not only helps us to develop more reproducible neural nets but also helps us in training them better as we will see in this article. Let’s dive in!
The Different Weight Initialization Schemes
We are going to study the effects of the following weight initialization schemes:
- Weights initialized to all zeros
- Weights initialized to all ones
- Weights initialized with values sampled from a uniform distribution with a fixed bound
- Weights initialized with values sampled from a uniform distribution with a careful tweak
- Weights initialized with values sampled from a normal distribution with a careful tweak
Finally, we are going to see the effects of the default weight initialization scheme that comes with tf.keras.
Experiment Setup: The Data and the Model
To make the experiments quick and consistent, let’s fixate on the dataset and simple model architecture. For doing experiments like this, my favorite dataset to start off with is the FashionMNIST dataset. We will be using the following model architecture:

The model would take a flattened feature vector of shape (784, ) and after passing through a set of dropout and dense layers, it would produce a prediction vector of shape (10, ) which correspond to the probabilities of the 10 different classes present in the FashionMNIST dataset.
This would be the model architecture we will be using for all experiments. We will be using the sparse_categorical_crossentropy as the loss function and the Adam optimizer.
Method 1: Weights initialized to all zeros
Let’s first throw a weight vector of all zeros to our model and see how it performs in 10 epochs of training. In tf.keras, layers like Dense, Conv2D, Long Short-Term Memory (LSTM) have two arguments - kernel_initializer and bias_initializer. This is where we can pass in any pre-defined initializer or even a custom one. I would recommend you to take a look at this documentation which enlists all the available initializers in tf.keras.
We can set the kernel_initializer argument of all the Dense layers in our model to zeros to initialize our weight vectors to all zeros. Since the bias is a scalar quantity, even if we set it to zeros it won’t matter as much as it would for the weights. In code, it would look like so:
tf.keras.layers.Dense(256, activation='relu', kernel_initializer=init_scheme, bias_initializer='zeros')
Model performance with all weights initialized to all zeros
Run set
1
As we can clearly see in the above two plots the validation loss and the training loss diverge from each other to a great extent and the validation accuracy remains flat across all the epochs. This indicates that our model is really struggling to train and for a dataset like FashionMNIST it should not be the case ideally given the model architecture. This is happening mainly because initially, our model is starting with all zeros but nothing else. Hence the weight updates that happening because of backpropagation is not proving to be effective enough for the model to cut through.
Therefore, it’s safe enough to conclude that our model needs a way better starting point i.e. a better weight initialization.
Method 2: Weights initialized to all ones
Run set
1
In tf.keras initializing our model weights with all ones is similar to what we did previously - just change zeros to ones.
As we can see in the plots above, the loss decrease certainly looks better, much better than throwing all zeros. The training and the validation accuracy also seemed to be in sync.
Studies have shown that initializing the weights with values sampled from a random distribution instead of constant values like zeros and ones actually helps a neural net train better and faster. The imposed randomness is not only very suitable for gradient-based optimization techniques but also it helps a network to better guide which weights to update. Intuitively, with a constant weight initialization, all the layer outputs during the initial forward pass of a network are essentially the same and this makes it very hard for the network to figure out which weights to be updated.
Let’s now see what happens if we initialize our model weights with values sampled from a uniform distribution.
Method 3: Weights initialized with values sampled from a uniform distribution
If you think from a mathematical viewpoint, a neural net is nothing but a chain of functions applied on top of each other. In these functions, we generally multiply an input vector with a weight vector and add a bias term to the product vector (think of broadcasting). We then pass the final vector through an activation function and then proceed from there.
Ideally, we would want the values of the weight vector to be in such a way that they do not end up causing a data loss in the input vector. Ultimately we are multiplying the weight vector with the input vector, so we need to be very careful. So, it’s often a good practice to keep the values of the weight vector to be as small as possible but not very small so that they end up causing numerical instabilities.
In the earlier experiments, we saw that initializing our model with constant values is not a good idea. So, let’s try initializing them with unique small numbers having [0,1] range. We can do this by sampling values from a [uniform distribution](uniform distribution). A uniform distribution looks like so:

A uniform distribution within [-5, 5] range
Here’s the catch with uniform distributions - the values from a uniform distribution have an equal chance of being sampled.
Initializing a tf.keras Dense layer with a uniform distribution is a bit more involved than the previous two schemes. We would make use of the tf.keras.initializers.RandomUniform(minval=min_val, maxval=max_val, seed=seed) class here. In this case, we would be supplying 0 as the minval and 1 as the maxval. seed could be any integer of your choice. Let’s see how it performs!
Let’s see how it performs!
Run set
1
The Results So Far - Ones vs Uniform Weights Accuracy Trade-Off
Although the losses are pretty much similar to the previous experiment (where weights were initialized with ones) the accuracy has improved quite a lot.
The plot below makes it even easier to see that:
Run set
2
Quick Tangent: The recipe for initializing weights
As we saw in the previous experiment that having some randomness when initializing the weights in a neural net can clearly help. But could we control this randomness and provide some meaningful information to our model? What if we could pass some information about the inputs we would feed to the model and have the weights somehow dependent on that?
We can do this! The following rule (from Udacity’s lesson on Weight Initialization) helps us in doing so:

Method 4: Weights initialized with values sampled from a uniform distribution with a careful tweak
So, what we would do is instead of sampling values from a uniform distribution of [0,1] range, we would replace the range with [-y,y]. We have got a number of ways in which we could do this in tf.keras but I found the following way to be more customizable and more readable.
# iterate over the layers of a given modelfor layer in model.layers:# check if the layer is of type `Dense`if isinstance(layer, tf.keras.layers.Dense):# shapes are important for matrix multshape = (layer.weights[0].shape[0], layer.weights[0].shape[1])# determine the `y` valuey = 1.0/np.sqrt(shape[0])# sample the values and assign them as weightsrule_weights = np.random.uniform(-y, y, shape)layer.weights[0] = rule_weights # weightslayer.weights[1] = 0 # bias
Let’s see how this performs
Run set
1
We can clearly see that our model shows much better training behavior. Not only it is starting to generalize well but also it shows much better accuracy.
This leaves us to our final experiment where we would sample values from a normal distribution with its standard deviation set to y.
Method 5: Weights Initialized With Values Sampled From a Normal Distribution With a Careful Tweak
Let’s start with why - why use normal distribution here? Earlier, I mentioned that smaller weight values might be better for a network to train well. Now, in order to keep these initial weight values close to 0 normal distribution would be better suited than uniform distribution since, in a uniform distribution, there is an equal probability for a number to get sampled. But for a normal distribution, that’s not the case. We would take a normal distribution having a mean of 0 and the standard deviation would be set to y.
As can be seen in the following figure (which mimics a normal distribution) most of the values would be concentrated in the mean value region. In our case, this mean value would be 0 so, it might work as we are thinking.

A sample normal distribution
The code for initializing the weights with this scheme would be pretty much similar, we are going to swap the uniform rule with a normal one -
# iterate over the layers of a given modelfor layer in model.layers:# check if the layer is of type `Dense`if isinstance(layer, tf.keras.layers.Dense):# shapes are important for matrix multshape = (layer.weights[0].shape[0], layer.weights[0].shape[1])# determine the `y` valuey = 1.0/np.sqrt(shape[0])# sample the values and assign them as weightsrule_weights = np.random.normal(0, y, shape)layer.weights[0] = rule_weights # weightslayer.weights[1] = 0
Here's How It Performs
Run set
2
As we can see the comparison here is very hard but we also need to consider that our network is not deep enough for comparing whether or not sampling from a normal distribution would be beneficial for a neural net. I would leave this up to you for a fun weekend project.
In the next section, we will take some of the methods we discussed and compare how the weights get affected by them as our network gets trained.
Effects On Training With Careful Initialization
Let’s now see how different initialization methods affect the parameters of our network as it trains. Let’s take the uniform initialization (with [0,1] range) scheme first. TensorBoard (a tool by the TensorFlow team for visualizing and debugging machine learning models) allows us to visualize the learned parameters of a model in histograms and distributions. We will stick to histograms for this article.

Histograms represent different brackets of values with respect to their occurrences. In the above figure, we can see that most of the weights across the different layers are well spread out across the range of [0,1]. Here are the histograms of our network initialized with the uniform distribution but with the recipe:
Histograms of the learned parameters of our network initialized with a uniform distribution but with the recipe.

We can clearly notice that when our network is initialized with the constrained uniform distribution the dispersion in the weight distribution is less and most of the values are closer to zero which is what we wanted.
I encourage you to try out this observation with the other methods we discussed. Weights & Biases makes it extremely easy to sync up your TensorFlow event files so that you can host TensorBoard instances in your Weights & Biases run page itself. I will not go into the code for this portion of this article but if you're interested, check out the colab notebook →.
Some Tips and Last Thoughts
Thanks for sticking with me throughout the article. The study of weight initialization in neural nets is indeed very interesting to me as it plays a significant role in training them better. As a fun exercise, you might also see what is the default initializers in tf.keras when it comes to the Dense layers and compare the results to the ones shown in this article.
By now, you should have a mindset that would help you to systematically investigate why your neural net might not be training well. Practically, there can be a lot of reasons for that but weight initialization is definitely one of them. You now have a list of goto initializers that you would experiment with.
Also, I wanted to share some very good references which you can look up if you are interested in studying more about the topic. This study gained popularity with a seminal paper named Understanding the difficulty of training deep feedforward neural networks by Xavier et al. It was then studied quite well by Kaiming et al. in their paper Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
That same year, Dmytro et al. published a paper named All you need is a good init where they proposed a very simple yet effective weight initialization scheme called Layer-wise Sequential Unit Variance (LSUV). LSUV has shown tremendous performance improvements on deeper architectures and has easily become a favorite choice among practitioners. There’s also some study on weight-agnostic neural nets by Adam et al. in their paper Weight Agnostic Neural Networks but it is yet to get the amount of attention the earlier schemes have got.
Selecting the right combination of weight initialization method and activation function is also an important study and I highly recommend reading this deeplearning.ai article if you are interested to know about it.
I am highly indebted to Jeremy Howard and his team at fast.ai because it was fast.ai’s course Deep Learning from the Foundations (taught by Jeremy himself) which triggered my interest to study the topic of weight initialization. If you haven’t checked out the course yet, take the time to do so.
I hope this article gave you a sense of how important a role a weight initialization method plays in training a neural net. I am excited to see if it helps in improving the performance of your custom neural nets too.
Try various weight initialization methods in Colab for yourself →.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.