why RELU works?

二、神经网络基本结构-深度学习EECS498/CS231n

Vector Derivation

二、神经网络基本结构-深度学习EECS498/CS231n

But Jacobian is sparse: off-diagonal entries are all zero! Never explicitly form Jacobian.

二、神经网络基本结构-深度学习EECS498/CS231n

结果我们发现,dy/dx_1,1 = [3,2,1,-1]其恰好是w的第一行,故而不需要显性地进行求解。

dL/dxi,j=(dy/dxij)(dL/dy)=(wi)(dL/dyi,:)\begin{array}{l} \mathrm{dL} / \mathrm{dx}_{\mathrm{i}, \mathrm{j}} \\ =\left(\mathrm{dy} / \mathrm{dx}_{\mathrm{ij}}\right) \cdot(\mathrm{dL} / \mathrm{dy}) \\ =\left(\mathrm{w}_{\mathrm{i}}\right) \cdot\left(\mathrm{dL} / \mathrm{dy}_{\mathrm{i},:}\right) \end{array}

dL/dx=(dL/dy)w[N×D][N×M][M×D]\begin{array}{l} \mathrm{dL} / \mathrm{d} \mathrm{x}=(\mathrm{d} \mathrm{L} / \mathrm{d} \mathrm{y}) \mathrm{w}^{\top} \\ {[\mathrm{N} \times \mathrm{D}] \quad[\mathrm{N} \times \mathrm{M}][\mathrm{M} \times \mathrm{D}]} \end{array}

dL/dw=xT(dL/dy)\mathrm{dL} / \mathrm{dw}=\mathrm{x}^{\mathrm{T}}(\mathrm{d} \mathrm{L} / \mathrm{d} \mathrm{y})

注意,上面的表达不是真正的Jacobian

卷积层

最基本的卷积层:

二、神经网络基本结构-深度学习EECS498/CS231n

其中,image的channel数量必须和一个filter的channel保持一致。输出也可以看作是一个28*28的网格,每一个网格有一个6-dim的向量,代表了当前位置的某周结构上的信息

N是Batch size
二、神经网络基本结构-深度学习EECS498/CS231n

Receptive Fields

For convolution with kernel size k, each element in the output depends on a K * K receptive field in the input. Each successive convolution adds K-1 to the receptive field size. With L layers the receptive field size is 1+L*(K-1).

Stride Convolution

Input W, Filter K, Padding P, Stride: S -> Output: (W-K+2P)/S + 1

Example:

Input Volume: 3*32 * 32 Filter: 10 *3 *5 *5 Stride: 1 Pad: 2

Output Volume: (32 - 5 + 2* 2)/1 +1 =32, so 10* 32* 32

Learnable Parameters: per filter: 3* 5* 5 + 1=76, 10 filters, so total is 760

1X1 Convolution:

Commonly when we would like to change the structure of the Input.

所以1×1卷积可以从根本上理解为对这32个不同的位置都应用一个全连接层,全连接层的作用是输入32个数字(过滤器数量标记为,在这36个单元上重复此过程),输出结果是6×6×#filters(过滤器数量),以便在输入层上实施一个非平凡(non-trivial)计算。

Commonly:

Spatial size is decrease.

Number of channels is increase.

Pooling

池化层在前向传播的过程中,对输入进行了压缩,因而在反向推导时需要作up-sample处理。

max-pooling的反向传播

max-pooling的前向传播是把patch中最大的值传递给后一层,而其他像素的值直接被舍弃掉。那么反向传播也是把梯度直接传递给前一层某一个像素,而其他像素不接受梯度,也就是0。因而max-pooling过程中需要记录下池化操作时到底哪个像素的值是最大,也就是max id.

二、神经网络基本结构-深度学习EECS498/CS231n

Mean-pooling

二、神经网络基本结构-深度学习EECS498/CS231n

Batch Normalization

Helps reduce “internal covariate shift”, improves optimization.

x^(k)=x(k)E[x(k)]Var[x(k)]\widehat{x}^{(k)}=\frac{x^{(k)}-\mathrm{E}\left[x^{(k)}\right]}{\sqrt{\operatorname{Var}\left[x^{(k)}\right]}}

This is a differentiable function, so we can use it as an operator in our networks and backprop through it.

二、神经网络基本结构-深度学习EECS498/CS231n

Batch Size N, vector length D

what if the zero-mean. unit variance is too hard of a constraint?

-> after this normalization, we add learnable parameters to the networks to reconstruct the distribution.

二、神经网络基本结构-深度学习EECS498/CS231n

But the estimates depend on the minibatch; can’t do this at test-time.-> different operations at test-time.

在测试阶段,我们将average of values seen during training作为常值保存下来,不再求算术平均值。在这种情况下,BN变成了一个线性算子,其可以和之前的FC后者Conv融合,从而减少算力损耗。通常BN插在FC或者Conv之后,而在非线性之前。

优点:

  1. Makes deep networks much easier to train.
  2. allows higher learning rate, faster convergence
  3. Become more robust to initialization
  4. Acts as regularization during training
  5. Zero overhead at test-time

But 缺点:

  1. Not well-understood
  2. Behaves differently during training and testing. 一是需要代码中加上trigger;其次,如果我们的数据非常不balance,那么testing的效果可能不好。

二、神经网络基本结构-深度学习EECS498/CS231n

Layer Normalization 常在循环网络和transformer中使用

In image:

二、神经网络基本结构-深度学习EECS498/CS231n

二、神经网络基本结构-深度学习EECS498/CS231n

相关文章: