A Gentle Introduction To Weight Initialization for Neural Networks

An explainer and comprehensive overview of various strategies for neural network weight initialization. Made by Saurav Maheshkar using Weights & Biases
Saurav Maheshkar

Colab Notebook and Github Repository

Table of Contents (click to expand)

What is Weight Initialization in Deep Learning ?

Figure 1: Consider this curve, our network's performance and training time will depend on where the model initially starts at A, B or C.
Weight Initialization was first discussed as a "trick" (LeCun et al; 1998) to prevent certain undesirable behaviours during neural network training. The initial values of the weights can have a significant impact on the training process.
As shown in Fig. 1 above for instance, depending on where the deep learning model starts in the training process, it can converge to any of the possible local minima's in the irregular loss surface. You might think that initializing all weights to zero or some random initialization should be a good starting point, but as we'll find out, the more we think about this, the more interesting the problem becomes. Mishkin and Matas (2015) showed that arbitrary initialization can slow down or even stall the training.
In the end, a proper initialization of the weights in a neural network is critical to its convergence.

Initializating Neural Networks depends on 4️ main factors

1️. Number of Inputs D_{in}

The number of inputs varies because of the type of layer used. For a dense layer it's pretty straight forward. For a convolution layer it's along the lines of n_h \times n_w \times n_c

2️. Number of Outputs D_{out}

3️. Type of Non-Linearity

As we'll soon find out, the type of non-linearity used also affects the choice of our initialization strategy. The mean of the function is of prime importance.

4️. Type of Network

Some Popular Initialization Solutions

1. Uniform initialization: sample each parameter independently from U(-a, a)
2. Normal Initialization: sample each parameter independently from N(0, \sigma^{2})
3. Orthogonal Initialization: Initialize the weight matrix as orthogonal matrices, widely used for Convolutional Neural Networks (Saxe et al.)

0️⃣ Zero Initialization

As the name suggests, zero initialization involves initializing all neural network weights to 0. Note that this is done with the assumption that half of the final weights and positive and half the weights are negative.
As shown in the diagram and gradient histogram above, the activations remain the same during forward propagation and we achieve a sort of symmetry. This symmetry is hard to break as the network isn't really learning anything (the Symmetry Problem). If all the weights of the network are initialized to zero, all the activations are zero, and correspondingly so are the gradients. In fact, it doesn't matter even if the weights are initialized to any other constant. The activations might not be zero in this case but they will still be the same. Thus, we can rule out zero/constant initialization.

Initializing Network Weights to Small Random Numbers

NOTATION: Initialization of a single rank n-tensor with weights W \in \mathbb{R}^{d_1 \times ... \times d_n}, where d_i represents the dimensions along axis i.
The main idea here is to try initializing the weights to approximately zero by getting independent samples from a N(0, \sigma^2) Gaussian distribution.
But a significant problem with this approach is how to decide just how small should sigma be. That's because, as the value of sigma decreases, the variance of the output also decreases, leading us again to the symmetry problem.
Additionally, if the values are too small, the activations are small and so are the gradients obtained during backpropagation. For instance have a look at these histograms of the various activations for a simple 3 layered neural network with tanh and ReLU activation functions. The weights were initialized using the following code snippet :
W = np.random.randn(fan_in, fan_out) * 0.01
NOTE: The X-axis scale is decreasing as we go deeper into the network
As we can see, even in such a simple architecture the value of activations is drastically decreasing per layer, correspondingly we'll also observe low values of gradients during backpropagation.
Now, this is a problem. We need to maintain the same variance in activation distributions as we go deeper into the network. One might think that maybe we can increase the weights, to obtain some satisfactory activations, for instance let's try with :
W = np.random.randn(fan_in, fan_out) * 0.1
Well now we run into another problem, the activations become saturated around +1 and -1. Clearly, random initialization doesn't seem to be working. 🥲

LeCun Initialization

Now that we have identified our problem, we need our neurons to have significant output variance. Or we need to normalize the variance to obtain a nice even distribution of values are gradients. Let's look at method that can help us achieve this standardization.
ASSUMPTION: x_i, w_i are i.i.d with zero mean \implies\mathbb{E}(w_i) = \mathbb{E(x_i)} = 0
We assume a simple architecture, with an n-dimensional input vector being fed into a dense layer with j neurons. Our aim is to make the variance of the weighted sum, a smooth and consistent gaussian

Var(\mathbb{Z}^{[1]}_{1})= \sum_{j} \mathbb{E}(w_j)^2Var(x_j) + \mathbb{E}(x_j)^2Var(w_j) + Var(x_j)Var(w_j)

Var(\mathbb{Z}^{[1]}_{1})= \sum_{j} Var(x_j)Var(w_j)

Var (\mathbb{Z}_{1}^{[1]})= n_{in} Var(w) Var(x)

Aim: Activations to have the same variance as the features \implies Var(\mathbb{Z}^{[1]}_{1}) = Var(x)

\implies n_{in} Var(w) = 1

We know that Var(aX) = a^2Var(x) \implies we need to scale our weights by \sqrt{\frac{1}{n_{in}}}
Thus, effectively in LeCun Initialization we initialize weights from N(0, \frac{1}{n_{in}})

Xavier Initialization

Further work by Xavier Glorot and Yoshua Bengio in their paper "Understanding the Difficulty of Training Deep Feedforward Neural Networks [4]" led to another popular initialization strategy where even n_{out} was taken into account for efficient performance during backpropagation. Our 'variance approach' only accounted for even distributions during a forward pass but to maintain the same distribution during backpropagation, we must consider n_{out} as well.
W \sim N(0, \frac{2}{n_{in} + n_{out}})
FUN FACT 🍬: Glorot Uniform is the default initialization strategy for tf.keras.layers.Dense.

He Initialization

When ReLU (Rectified Linear Activation Unit), started gaining popularity another issue was noticed, the behaviour of ReLU with initialization strategies such as Glorot had the same kind of distributions as tanh was having with random initialization. Because ReLU is typically defined as f(x) = max(0, x), notice that this function does not have a zero mean. Therefore our initial assumptions about w, is wrong and we cannot use the same initialization strategies. To account for this shift, (Kaiming He et al.) in their seminal paper "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [5]", introduced a new initialization often termed as "He Initialization".
W \sim N(0, \frac{2}{n_{in}})

Finishing Up

Having gone through some of the popular methods of initialization, let's summarize some key takeways :

References 📚

  1. Efficient BackProp: Yann LeCun, Leon Bottou, Genevieve B. Orr and Klaus-Robert Müller
  2. All you need is a good init: Dmytro Mishkin and Jiri Matas
  3. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks: Andrew M. Saxe, James L. McClelland and Surya Ganguli
  4. Understanding the difficulty of training deep feedforward neural networks: Xavier Glorot and Yoshua Bengio
  5. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification; Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun