Coursera | Andrew Ng (01-week-4-4.3)—深层网络中的前向传播

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79025300

4.3 Forward propagation in a deep neural network 深层网络中的前向传播

(字幕来源：网易云课堂)

Coursera | Andrew Ng (01-week-4-4.3)—深层网络中的前向传播

In the last video we discussed what is the deep multi layer neural network,and also talked about the notation we use to describe such networks in this video. you see,how you can perform for propagation in a deep network as usual let’s first go over what forward propagation will look like for a single training example x,and then later on we’ll talk about the vectorized version,where you want to carry out forward propagation on the entire training set at the same time,but given a single training example x,here’s how you compute the activations of the first layer,so for this first layer you compute z[1] equals w1 times x plus b1,so w[1] and b1 are parameters that affect the activations in layer 1,this is layer one of the neural network,and then you compute the activations for that layer to be equal to g of z[1],and the activation function g depends on what layer you’re at,and maybe index that as the activation function from layer 1.

上个视频我们讨论了，什么是深度神经网络，也谈过我们用来描述，这种网络的符号约定，在这个视频里你会看到，如何在深度网络中灵活应用正向和反向传播，跟以前一样我们先来看看对其中一个训练样本 x，该怎么应用正向传播,之后我们会讨论向量化的版本，也就是当你想要对整个训练集应用正向传播的时候，假设这里有一个训练样本 x，这就是你该怎么计算第一层的**单元，在第一层里需要计算z[1]等于w[1]乘以 x 加上b[1]，那么w[1]和b[1]就是会影响在第一层的**单元的参数，这就是神经网络的第一层，然后你要计算这一层的**函数，a[1]等于gz[1]，那么**函数 g 的指标取决于所在的层数，在第一层的情况下上标就是 1。

Coursera | Andrew Ng (01-week-4-4.3)—深层网络中的前向传播

重点：

z[1]=w[1]x+b[1]a[1]=g[1](z[1])z[2]=w[2]a[1]+b[2]a[2]=g[2](z[2])

因为 x=a[0]

z[1]=w[1]a[0]+b[1]a[1]=g[1](z[1])z[2]=w[2]a[1]+b[2]a[2]=g[2](z[2])

so if you do that you’ve now computed the activations from layer 1,how about layer 2 say that layer,well you would then compute z[2] equals w[2] a[1] plus b[2],and then so the activation of layer 2,is the weight matrix times the output of layer 1,so is that value plus the bias vector for layer 2,and then a[2] equals the activation function applied to z[2],ok so that’s it for layer 2 and so on and so forth,until you get to the output layer that’s layer 4,where you would have that z[4] is equal to,the parameters for that layer times,the activations from the previous layer,Plus that bias vector and then similarly a[4] equals g of z[4],and so that’s how you compute your estimated output y hat,so just one thing to notice,x here is also equal to a[0],because the input feature vector x is also the activations of layer 0,so we scratch out x and with a cross out x and put a[0] here,then you know all of these equations basically look the same right,the general rule is that zl,is equal to wl times a of l minus 1 plus bl 1 there,and then the activations for that layer,is the activation function applied to the values z,so that’s the general forward propagation equations.

当你已经计算好了第一层的**函数，第二层会怎么样呢就是这层，这里你需要计算z[2]等于w[2]・a[1]加上b[2]，然后第二层的**函数，就是权重矩阵乘以第一层的输出值，也就是这个值加上第二层的偏置向量，然后a[2]等于作用于z[2]的**函数，第二层就算好了后面几层以此类推，直到你算到第四层也就是输出层，这里z[4]等于，这一层的参数，乘以前一层的**函数，再加上偏置向量类似地a[4]等于g(z[4])，然后结果就是想要估算的y^的值，有一点需要注意的是，x在这里也等于a[0]，因为输入特征向量 x 也是第 0 层的**单元，所以我们把 x 划掉然后写上a[0]，然后这些等式基本上看起来就一样了，基本规律就是z[l]，等于w[l]・a[l−1]加上b[1]，这一层的**函数，就是作用到各个 z 值的函数，以上就是所有正向传播的公式了。

重点：

z[1]=w[1]a[0]+b[1]a[1]=g[1](z[1])z[2]=w[2]a[1]+b[2]a[2]=g[2](z[2])...z[4]=w[4]a[3]+b[4]a[4]=g[4](z[4])=y^

基本规律：

$z^{[l]}=w^{[l]}a^{[l-1]}+b^{[l]}\a^{[l]}=g^{[l]}(z^{[l]})

so we’ve done all this for a single training example,how about for doing it,in a vectorized way for the whole training set at the same time,the equations look quite similar as before,for the first layer you would have Capital Z[1],equals W[1] times capital X plus b1,and then a1 equals g of z1 right,and bear in mind that X is equal to A[0],these are just you know the training examples,stacked in different columns,you could take this.let me scratch out X so you can put A[0] there,and then so the next layer looks similar,Z[2] equals W[2]・A[1] plus b[2],and A[2] equals g[2](Z[2]),we’re just taking these vectors z or a and so on and stacking them up,so this is the vector for the first training example,Z vector for the second training example,and so on down to the M training example,and stacking these in columns and calling this Capital Z,all right and similarly for capital A,just as capital X all of the training examples,are column vectors stacked left to right,and then again at end of this process you end up with Y hat,which is equal to g of Z4 sothis is also equal to A4,and that’s the predictions on all of your training examples,is stacked horizontally.

那么我们已经过了一遍针对一个训练样本的过程，接下来看看，怎么用向量化的方法训练整个训练集，公式其实都差不多，第一层是Z[1]，等于W[1]乘以X[1]加上b[1]，然后A[1]等于g[1](Z[1])，要记得 X 等于A[0]，这些其实只是把训练样本，写成列向量叠在一起，你就可以把 X 划掉在这里写上A[0]，接下来的一层也是同样，Z[2]等于W[2]・A[1]加上b[2]，然后A[2]等于g[2](Z[2])，我们要做的只是在把所有的z或者a向量叠起来，这是第一个训练样本的向量 z，第二个训练样品，到第 m 个训练样本，把它们写成列向量叠起来构成矩阵 Z，矩阵 A 做法也一样，也就是把所有的训练样本，写成列向量从左到右叠起来，向量化之后你可以得到y^，等于g[4](Z[4])，等于A[4]，那就是把所有的训练样本的预测值，水平地叠在一块儿。

Coursera | Andrew Ng (01-week-4-4.3)—深层网络中的前向传播

so just to summarize our notation.I’m going to modify this up here,our notation allows us to replace lowercase z and a,with the uppercase counterparts is that already looks like a capital Z,and that gives you the vectorized version of forward propagation,that you carry out on the entire training set at a time where a0 is x,now if you look at this implementation of vectorization,it looks like that there is going to be a for loop here right,so it’s left for i equals 1 to 4,for l equals 1 through capital L,then you have to compute the activations for layer 1,then for layer 2 then to layer 3 and then for layer 4,therefore so seems that there is a for loop here,and I know that when implementing your networks,we usually want to get rid of explicit for loops,but this is one place where I don’t think there’s any way,to implement this over other than an explicit for loop,so when we’re implementing forward propagation,it is perfectly OK to have a for loop,they compute the activations for layer 1,then there are 2 then layer threes and layer four,no one knows and I don’t think there is this any way to do this,without a for loops that goes from 1 to capital L,from 1 through the total number of layers and in neural network,so this place is perfectly okay to have an explicit for loop.

来总结一下用到的符号，我会在这上面修改一些东西，这种写法我们可以换掉小写的 z 和 a，换成对应的大写符号，就能得到一个同时针对整个训练集的，向量化的正向传播算法步骤其中A[0]就是X，现在如果你回顾一下向量化的全过程，其实就是一个for循环，for循环 i 从 1 到 4，for循环 i 等于 1 到大写 L，然后你去计算第一层的**函数，接着算第二三四层，所以看起来是个for循环，我猜你在用代码实现自己的网络时，通常是不想用显式for循环的，但是在这个情况下除了显式for循环，并没有更好的办法，所以当我们在实现正向传播的时候，用for循环也是可以的，它可以计算第一层的**函数，然后按顺序算好第二三四层等等，应该没有人能用除了for循环以外更好的方法，来一层层地计算 1 到 L，也就是从输入层到输出层的整个神经网络，这个地方用显式for循环是可以的。

so that’s it for the notation for deep neural networks,as well as how to do forward propagation in these networks,if the pieces we’ve seen so far looks a little bit familiar to you,that’s because what we’ve seen is taking a piece very similar to,what you’ve seen in the neural network with a single hidden layer,and just repeating that more times,now turns out that we implemented deep neural network,one of the ways to increase your odds of having bug-free implementation,is to think very systematically and carefully,about the matrix dimensions you’re working with,so when I’m trying to debug my own code.I’ll often pull a piece of paper and just think carefully through,the dimensions of the matrix I’m working with.let’s see how you could do that in the next video.

那么深度神经网络就讲的差不多了，还有如何在这些网络中使用正向传播，如果其中的过程你觉得很眼熟，那是因为这些步骤其实非常类似于，单隐层的神经网络的步骤，只不过是多重复几遍，事实上我们在实现深度神经网络的过程中，想增加得到没有 bug 的程序的概率…其中一个方法，需要非常仔细和系统化地，去思考矩阵的维数，我在自己 debug 的时候，通常会拿出一张纸一边很仔细地过一遍我在操作的矩阵的维数，在下一个视频里我会详细地解释具体怎么做。

重点总结：

x=a[0]

z[1]=w[1]a[0]+b[1]a[1]=g[1](z[1])z[2]=w[2]a[1]+b[2]a[2]=g[2](z[2])...z[4]=w[4]a[3]+b[4]a[4]=g[4](z[4])=y^

基本规律：

z[l]=w[l]a[l−1]+b[l]a[l]=g[l](z[l])

向量化：

X=A[0]

Z[1]=W[1]A[0]+b[1]A[1]=g[1](Z[1])Z[2]=W[2]A[1]+b[2]A[2]=g[2](Z[2])...Z[4]=W[4]A[3]+b[4]A[4]=g[4](Z[4])=Y^

基本规律：

Z[l]=W[l]A[l−1]+b[l]A[l]=g[l](Z[l])

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（1-4）– 浅层神经网络

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (01-week-4-4.3)—深层网络中的前向传播