why RELU works?
Vector Derivation
But Jacobian is sparse: off-diagonal entries are all zero! Never explicitly form Jacobian.
结果我们发现,dy/dx_1,1 = [3,2,1,-1]其恰好是w的第一行,故而不需要显性地进行求解。
注意,上面的表达不是真正的Jacobian
卷积层
最基本的卷积层:
其中,image的channel数量必须和一个filter的channel保持一致。输出也可以看作是一个28*28的网格,每一个网格有一个6-dim的向量,代表了当前位置的某周结构上的信息
N是Batch size
Receptive Fields
For convolution with kernel size k, each element in the output depends on a K * K receptive field in the input. Each successive convolution adds K-1 to the receptive field size. With L layers the receptive field size is 1+L*(K-1).
Stride Convolution
Input W, Filter K, Padding P, Stride: S -> Output: (W-K+2P)/S + 1
Example:
Input Volume: 3*32 * 32 Filter: 10 *3 *5 *5 Stride: 1 Pad: 2
Output Volume: (32 - 5 + 2* 2)/1 +1 =32, so 10* 32* 32
Learnable Parameters: per filter: 3* 5* 5 + 1=76, 10 filters, so total is 760
1X1 Convolution:
Commonly when we would like to change the structure of the Input.
所以1×1卷积可以从根本上理解为对这32个不同的位置都应用一个全连接层,全连接层的作用是输入32个数字(过滤器数量标记为,在这36个单元上重复此过程),输出结果是6×6×#filters(过滤器数量),以便在输入层上实施一个非平凡(non-trivial)计算。
Commonly:
Spatial size is decrease.
Number of channels is increase.
Pooling
池化层在前向传播的过程中,对输入进行了压缩,因而在反向推导时需要作up-sample处理。
max-pooling的反向传播
max-pooling的前向传播是把patch中最大的值传递给后一层,而其他像素的值直接被舍弃掉。那么反向传播也是把梯度直接传递给前一层某一个像素,而其他像素不接受梯度,也就是0。因而max-pooling过程中需要记录下池化操作时到底哪个像素的值是最大,也就是max id.
Mean-pooling
Batch Normalization
Helps reduce “internal covariate shift”, improves optimization.
This is a differentiable function, so we can use it as an operator in our networks and backprop through it.
Batch Size N, vector length D
what if the zero-mean. unit variance is too hard of a constraint?
-> after this normalization, we add learnable parameters to the networks to reconstruct the distribution.
But the estimates depend on the minibatch; can’t do this at test-time.-> different operations at test-time.
在测试阶段,我们将average of values seen during training作为常值保存下来,不再求算术平均值。在这种情况下,BN变成了一个线性算子,其可以和之前的FC后者Conv融合,从而减少算力损耗。通常BN插在FC或者Conv之后,而在非线性之前。
优点:
- Makes deep networks much easier to train.
- allows higher learning rate, faster convergence
- Become more robust to initialization
- Acts as regularization during training
- Zero overhead at test-time
But 缺点:
- Not well-understood
- Behaves differently during training and testing. 一是需要代码中加上trigger;其次,如果我们的数据非常不balance,那么testing的效果可能不好。
Layer Normalization 常在循环网络和transformer中使用
In image: