二、神经网络基本结构-深度学习EECS498/CS231n

why RELU works?

Vector Derivation

二、神经网络基本结构-深度学习EECS498/CS231n

But Jacobian is sparse: off-diagonal entries are all zero! Never explicitly form Jacobian.

二、神经网络基本结构-深度学习EECS498/CS231n

结果我们发现，dy/dx_1,1 = [3,2,1,-1]其恰好是w的第一行，故而不需要显性地进行求解。

$\begin{array}{l} \mathrm{dL} / \mathrm{dx}_{\mathrm{i}, \mathrm{j}} \\ =\left(\mathrm{dy} / \mathrm{dx}_{\mathrm{ij}}\right) \cdot(\mathrm{dL} / \mathrm{dy}) \\ =\left(\mathrm{w}_{\mathrm{i}}\right) \cdot\left(\mathrm{dL} / \mathrm{dy}_{\mathrm{i},:}\right) \end{array}$

$\begin{array}{l} \mathrm{dL} / \mathrm{d} \mathrm{x}=(\mathrm{d} \mathrm{L} / \mathrm{d} \mathrm{y}) \mathrm{w}^{\top} \\ {[\mathrm{N} \times \mathrm{D}] \quad[\mathrm{N} \times \mathrm{M}][\mathrm{M} \times \mathrm{D}]} \end{array}$

$\mathrm{dL} / \mathrm{dw}=\mathrm{x}^{\mathrm{T}}(\mathrm{d} \mathrm{L} / \mathrm{d} \mathrm{y})$

注意，上面的表达不是真正的Jacobian

卷积层

最基本的卷积层：

二、神经网络基本结构-深度学习EECS498/CS231n

其中，image的channel数量必须和一个filter的channel保持一致。输出也可以看作是一个28*28的网格，每一个网格有一个6-dim的向量，代表了当前位置的某周结构上的信息

N是Batch size
二、神经网络基本结构-深度学习EECS498/CS231n

Receptive Fields

For convolution with kernel size k, each element in the output depends on a K * K receptive field in the input. Each successive convolution adds K-1 to the receptive field size. With L layers the receptive field size is 1+L*(K-1).

Stride Convolution

Input W, Filter K, Padding P, Stride: S -> Output: (W-K+2P)/S + 1

Example:

Input Volume: 3*32 * 32 Filter: 10 *3 *5 *5 Stride: 1 Pad: 2

Output Volume: (32 - 5 + 2* 2)/1 +１ =32, so 10* 32* 32

Learnable Parameters: per filter: 3* 5* 5 + 1=76, 10 filters, so total is 760

1X1 Convolution:

Commonly when we would like to change the structure of the Input.

所以1×1卷积可以从根本上理解为对这32个不同的位置都应用一个全连接层，全连接层的作用是输入32个数字（过滤器数量标记为，在这36个单元上重复此过程）,输出结果是6×6×#filters（过滤器数量），以便在输入层上实施一个非平凡（non-trivial）计算。

Commonly:

Spatial size is decrease.

Number of channels is increase.

Pooling

池化层在前向传播的过程中，对输入进行了压缩，因而在反向推导时需要作up-sample处理。

max-pooling的反向传播

max-pooling的前向传播是把patch中最大的值传递给后一层，而其他像素的值直接被舍弃掉。那么反向传播也是把梯度直接传递给前一层某一个像素，而其他像素不接受梯度，也就是0。因而max-pooling过程中需要记录下池化操作时到底哪个像素的值是最大，也就是max id.

二、神经网络基本结构-深度学习EECS498/CS231n

Mean-pooling

二、神经网络基本结构-深度学习EECS498/CS231n

Batch Normalization

Helps reduce “internal covariate shift”, improves optimization.

$\widehat{x}^{(k)}=\frac{x^{(k)}-\mathrm{E}\left[x^{(k)}\right]}{\sqrt{\operatorname{Var}\left[x^{(k)}\right]}}$

This is a differentiable function, so we can use it as an operator in our networks and backprop through it.

二、神经网络基本结构-深度学习EECS498/CS231n

Batch Size N, vector length D

what if the zero-mean. unit variance is too hard of a constraint?

-> after this normalization, we add learnable parameters to the networks to reconstruct the distribution.

二、神经网络基本结构-深度学习EECS498/CS231n

But the estimates depend on the minibatch; can’t do this at test-time.-> different operations at test-time.

在测试阶段，我们将average of values seen during training作为常值保存下来，不再求算术平均值。在这种情况下，BN变成了一个线性算子，其可以和之前的FC后者Conv融合，从而减少算力损耗。通常BN插在FC或者Conv之后，而在非线性之前。

优点：

Makes deep networks much easier to train.
allows higher learning rate, faster convergence
Become more robust to initialization
Acts as regularization during training
Zero overhead at test-time

But 缺点：

Not well-understood
Behaves differently during training and testing. 一是需要代码中加上trigger；其次，如果我们的数据非常不balance，那么testing的效果可能不好。

二、神经网络基本结构-深度学习EECS498/CS231n

Layer Normalization 常在循环网络和transformer中使用

In image:

二、神经网络基本结构-深度学习EECS498/CS231n