Skip to content
Home » Blog » Weight initialization

Weight initialization

Weight initialization is a procedure to set the weights of a neural network to some values that defines the starting point for the optimization.

Generally, we follow some simple heuristics to initialize weights such as [-0.3, 0.3], [0, 1], [-1, 1]. The choice of weight initialization method depends on the activation function used in the network. Let us look at a few techniques to initialize weights in a network.

Glorot or Xavier initialization

The method aims to keep the variance of activations and gradients approximately the same for each layer to ensure that the network can learn effectively.

This method sets a layer’s weight to values chosen from a uniform probability distribution between

    \[[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}]\]

Normalized Xavier initialization

This is a slight modification from Xavier initialization. We take a random number from a uniform probability distribution between

    \[[-\frac{\sqrt{6}}{\sqrt{n_{i} + n_{i+1}}}, \frac{\sqrt{6}}{\sqrt{n_{i} + n_{i+1}}}]\]

where ni is the number of incoming connections and ni+1 is the number of outgoing connections. This initialization method should only be used with the sigmoid or tanh activation function and not with the ReLU activation function.

Kaiming / He initialization

Unlike Glorot initialization, He initialization [1] is only used with the ReLU activation function and not with sigmoid and tanh. In this initialization, we pick a number from a gaussian probability distribution with a mean of 0 and standard deviation of √ (2/n).

Let’s try a simplistic approach to understand why this standard deviation is needed.

  • Let’s say we’ve 512 input nodes and 256 nodes in the next layer.
  • Each has mean 0 and std 1 and when summed – 512. It follows that these 512 products would have a mean of 0 and variance of 512 i.e. std of √ 512.
  • We want the std to be 1 so we divide all the values with √ 512. Now each element will have a variance of 1/√ 512. Therefore y will have a variance of 1.
  • Since ReLU removes the negative weight the standard deviation of He initialization is √ (2/n).

References

  1. Kaiming/He initialization research paper.

Leave a Reply

Your email address will not be published. Required fields are marked *