Improving Deep Neural Networks

1.10 Vanishing/Exploding Gradients

Improving Deep Neural Networks
One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.

Improving Deep Neural Networks
假设**函数 $g (z) = z$ ，即**函数是线性的，并且没有偏移项 $b$ ：

\begin{aligned} z^{[1]} = W^{[1]} x a^{[1]} = z^{[1]} \\ z^{[2]} = W^{[2]} a^{[1]} = W^{[2]} W^{[1]} x \\ ⋮ \\ \hat{y} = W^{[L]} W^{[L - 1]} W^{[L - 2]} \dots W^{[2]} W^{[1]} x \end{aligned}

The weights W, if they’re all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.
And if W is just a little bit less than identity. So this maybe here’s 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.
And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers.

1.11 Weight initialization for deep networks

Improving Deep Neural Networks

It turns out that a partial solution to vanishing/exploding gradients, doesn’t solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

So in order to make $z$ not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi and so if you’re adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, where n is the number of input features that’s going into a neuron.

\begin{aligned} R e l u : \\ v a r (W) = \frac{2}{n^{[l - 1]}} \\ W^{[l]} = n p . r a n d o m . r a n d n (s h a p e) * n p . s q r t (\frac{2}{n^{[l - 1]}}) \\ n^{[l - 1]} denotes the number of units in l - 1 layer \\ t a n h : \\ v a r (W) = \frac{1}{n^{[l - 1]}} Xavier Initialization \end{aligned}

So in practice, what you can do is set the weight matrix $W$ for a certain layer to be $n p . r a n d o m . r a n d n$ you know, and then whatever the shape of the matrix is for this out here, and then times square root of 1 over the number of features that I fed into each neuron and there else is going to be $n^{[(l - 1)]}$ $because that’s the number of units that I’m feeding into each of the units and they are l.

So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn’t solve, but it definitely helps reduce the vanishing, exploding gradients problem because it’s trying to set each of the weight matrices w you know so that it’s not too much bigger than 1 and not too much less than 1 so it doesn’t explode or vanish too quickly. I’ve just mention some other variants.