Neural Networks: Learning: Random initialization

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第十章《神经网络参数的反向传播算法》中第77课时《随机初始化》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助.
————————————————

In the previous videos we put together almost all the pieces you need in order to implement and train in your network. There’s just one last idea I need to share with you which is the idea of random initialization.

When you’re running an algorithm like gradient descent or also the advanced optimization algorithms, we need to pick some initial value for the parameters Neural Networks: Learning: Random initialization . So for the advanced optimization algorithm, you know it assumes that you will pass it some initial value for the parameters . Now, let’s consider gradient descent. For that, we also need to initialize to something. And then we can slowly take steps to go downhill, using gradient descent to go downhill to minimize the function Neural Networks: Learning: Random initialization . So what do we set the initial value of to? Is it possible to set the initial value of to the vector of all zeros? Whereas this worked okay when we were using logistic regression. Initializing all of your parameters to zero actually does not work when you’re training a neural network.

Neural Networks: Learning: Random initialization

Consider training the following neural network. And let’s say we initialized all of the parameters in the network to zero. And if you do that then, what that mean is that at the initialization, this blue weight that I’m covering blue is going to equal to that weight( Neural Networks: Learning: Random initialization ), so, they’re both zero. And this weight that I’m covering red is equal to that weight which I’m covering red(). And also this weight, well which I’m covering it in green is going to be equal to the value of that weight(). And what that means is that both of your hidden units: Neural Networks: Learning: Random initialization and are going to be computing the same function of your inputs. And thus, you end up with for every one of your training examples, you end up with . And moreover, because I’m not going to show this in too much detail, because these outgoing weights are the same you can also show that the Neural Networks: Learning: Random initialization values are also going to be the same. So concretely, you end up with . And if you work through the map further, what you can show is that the partial derivative with respect to your parameters will satisfy the following. That the partial derivative of the cost function with respect to, I’m writing out the derivatives with respect to these two weights in your neural network. You’ll find that these two partial derivatives are going to be equal to each other. And so what this means is that even after say one gradient descent update, you’re going to update, say this first blue weight with learning rate times this, and you’re going to update the second blue weight to a some learning rate times this. But what this means is that even after one gradient descent update, those two blue weights, those two blue color parameters will end up the same as each other. So they’ll be some non-zero value now, but this value will be equal to that value. And similarly, even after one gradient descent update, this value will equal to that value (red), they’ll be some non-zero values, just that the two red values will be equal to each other. And similarly the two green weights, they’ll both change values but they’ll both end up the same value as each other. So after each update, the parameters corresponding to the inputs going to each of the two hidden units are identical. That’s just saying that the two green weights are still the same, the two red weights are still the same, the two blue weights are still the same. And what that means is that even after on iteration of, say gradient descent, you’ll find that your two hidden units are still computing exactly the same function of that input. So you still have this Neural Networks: Learning: Random initialization . And so you’re back to this case. And as keep running gradient descent, the two blue weights will stay the same with each other. And what this means is that your neural network really can’t compute very interesting functions. Imagine that you had not only two hidden units, but imagine that you had many many hidden units. Then what this is saying is that all of your hidden units are computing the exact same feature, all of your hidden units are computing all of the exact same function of the input. And this is a highly redundant representation. Because that means that your final logistic regression unit really only gets to see one feature because all of these are the same and this prevents your neural network from learning something interesting. In order to get around this problem, the way we initialize the parameters of a neural network therefore, is with random initialization.

Neural Networks: Learning: Random initialization

Concretely, the problem we saw on the previous slide is sometimes called the problem of symmetric weights, that is if the weights all being the same. And so this random initialization is how we perform symmetry breaking. So what we do is we initialize each value of Neural Networks: Learning: Random initialization to a random number between and . So this is a notation to mean numbers between and . So my weights on my parameters are all going to be randomly initialized between and . The way I write code to do this in Octave, is as above. So this , that’s how you compute a random 10×11 dimensional matrix, and all the values are between 0 and 1. So these are going to be real numbers that take on any continuous values between 0 and 1. And so, if you take a number between 0 and 1 multiply it by Neural Networks: Learning: Random initialization , and minus an , then you end up with a number that’s between and . And incidentally, this ε here has nothing to do with the ε that we were using when we were doing gradient checking. So when we were doing numerical gradient checking, there we were adding some values of Neural Networks: Learning: Random initialization to . This is an unrelated value of , which is why I’m denoting in it , just to distinguish it from the value of we were using in gradient checking. Absolutely, if you want to initialize to a random 1×11 matrix, you can do so using this piece of code here. So to summarize, to train a neural network, what you should do is randomly initialize the weights to small values close to 0 between Neural Networks: Learning: Random initialization and . And then implement back propagation, do gradient checking, and use either gradient descent or one of the advanced optimization algorithms to try to minimize , as a function of the parameters . Starting from just randomly chosen initial value for the parameters, and by doing symmetry breaking, which is this process. Hopefully, gradient descent or the advanced optimization algorithms will be able to find a good value of Neural Networks: Learning: Random initialization .

<end>