【SENet】《Squeeze-and-Excitation Networks》

【SENet】《Squeeze-and-Excitation Networks》
CVPR-2018

caffe 代码：https://github.com/hujie-frank/SENet
caffe 代码可视化工具：http://ethereon.github.io/netscope/#/editor

文章目录

1 Background and Motivation
2 Advantages
3 Related work
4 Method

4.1 Squeeze: Global Information Embedding
4.2 Excitation: Adaptive Recalibration
4.3 Exemplars: SEInception and SEResNet
4.4 Model and Computational Complexity

5 Experiments

5.1 ImageNet Classification
5.2 Scene Classification
5.3 Object Detection on COCO
5.4 Analysis and Interpretation

小节

1 Background and Motivation

CNN extract informative features by fusing spatial and channel-wise information together within local receptive fields.

为了增强 CNN 的 representation power

several methods：enhancing spatial encoding（spatial correlations、spatial attention）比如，inception 中的多个感受野 concatenate
作者：focus on the channel relationship

因此作者设计出 Squeeze-and-Excitation 结构，emphasise informative features and suppress less useful ones（channel-wise）

2 Advantages

ILSVRC2017 classification first place
reduced the top-5 error to 2.251%

The development of new CNN architectures is a challenging engineering task, typically involving the selection of many new hyperparameters and layer configurations.

3 Related work

Deep architecture
Attention and gating mechanisms

4 Method

the role it performs at different depths adapts to the needs of the network

In the early layers, it learns to excite informative features in a class agnostic manner
In later layers, the SE block becomes increasingly specialised（in a highly class-specific manner）

SE 结构可以自成一派（用 SE block stacking 成 neural network），也可以中西结合，即插即用，as a drop-in replacement for the original block at any depth in the architecture（eg，resnet、resnext 的 bottleneck block）.

【SENet】《Squeeze-and-Excitation Networks》
$F_{tr}$ is a convolution operator ， $F_{tr}:X→U$ ，具体运算如下

对于上面公式的理解，可以参考如下图片

$X$ 即为输入的 feature map， $V_c$ （上图红色字体，小写c，黑色的为大写）为某个 filter， $u_c$ （上图红色字体，小写c，黑色的为大写）是某个生成的结果！

其中 $V = [v_1,v_2,...,v_C]$ （大写的 C）
其中 $U = [u_1,u_2,...,u_C]$ （大写的 C）
【SENet】《Squeeze-and-Excitation Networks》

上标表示 spatial kernel，也即我图片中拆分出 $X$ 和 $v_c$ （小写的 c ）画出来的部分！这样就明朗了很多，至于从这个耳熟能详的公式，如何就能引发对 channels 的特征重要性的思考，进而提出 SE block 的结构，我目前还体会不出来！

4.1 Squeeze: Global Information Embedding

【SENet】《Squeeze-and-Excitation Networks》

$u_c$ 是特征图 $u$ 的 $c$ 通道，上面的公式是对该通道进行 global average pooling， $z_c$ 为标量，是 channel descriptor（如下图）的一小格

当然，global average pooling 只是一种统计全局信息的方式，more sophisticated aggregation strategies could be employed here as well.

4.2 Excitation: Adaptive Recalibration

两个设计准则

flexible（channels 之间有 non-linear interaction）
non-mutually-exclusive（非互斥的，避免 one-hot）

作者落地的方式为：employ a simple gating mechanism with a sigmoid activation，再细化一点即 two fully connection，再具体一点，如下图所示，第一个 fc 降低 dimension，activation function 为 relu，第二个还原为原来的 dimension，activation function 为 sigmoid（借鉴 LSTM 中的门机制）

【SENet】《Squeeze-and-Excitation Networks》
公式如下：

$\delta$ 为 relu， $\sigma$ 为 sigmoid

上述公式的意义为， $U$ 的一个 channels 与 $s$ 的一个 dimension 相乘，相当于对 feature map 的加权！对应如下图 $F_{scale}$ 部分！

最后的输出
【SENet】《Squeeze-and-Excitation Networks》
总结一下

$F_{sq}$ global average pooling
$F_{ex}$ two fully connection
$F_{scale}$ feature map （ $U$ ） multiply channels weight （ $F_{ex}$ 的输出结果）

4.3 Exemplars: SEInception and SEResNet

【SENet】《Squeeze-and-Excitation Networks》

SEInception：
$F_{tr}$ 替换成 Inception block，关于 Inception 的理论与实践，可以参考 https://blog.csdn.net/bryant_meng/article/details/78597190 中 1.1 Classification / Object Detection 和 4.1 【Keras】Classification in CIFAR-10 系列连载

【SENet】《Squeeze-and-Excitation Networks》
左边正常的 inception，右边 SE-Inception

SE-ResNet：

$F_{tr}$ 替换成 non-identity branch of a residual module

4.4 Model and Computational Complexity

trade-off between model complexity and performance

	ResNet-50	SE-ResNet-50
GPU：training a mini-batch 256 images，8 TItan X	190 ms	209 ms
CPU：inference	164 ms	167ms

global pooling and inner product are less optimised in existing GPU libraries

额外的参数量如下：two FC layers of the gating mechanism
【SENet】《Squeeze-and-Excitation Networks》

s 为 stage
r 为 reduction ratio
$N_s$ 为 repeated block number for stage s.
$C_s$ 为 the dimension of the output channels，也即 number of channels

【SENet】《Squeeze-and-Excitation Networks》
看这个图就知道怎么计算了，某个 stage 中的一个 block 的计算量增加量为 $\frac{C}{r}*C+\frac{C}{r}*C$ ，我们都知道，越后面的 stage，C 越大，增加的计算量也越大，作者实验表明，去掉后面 stage 的 SE 结构，效果不会降太多，但是计算量会增加的少一些！

5 Experiments

database

ImageNet 2012
COCO
Places 365-Challenge

$r$ ：reduction ratio is 16

5.1 ImageNet Classification

【SENet】《Squeeze-and-Excitation Networks》
看 SENet 的小括号，加 SE 结构效果都有提升！看看下面训练和测试的 loss

看看在轻量级网络上的表现

看 table 2 和 table 3 SENet 小括号中的内容就说明了一起，强，有普适性， can be used in combination with a wide range of architectures.（residual or no residual）

华山论剑，一决雌雄
【SENet】《Squeeze-and-Excitation Networks》

5.2 Scene Classification

【SENet】《Squeeze-and-Excitation Networks》
providing evidence that SE blocks can perform well on different datasets

5.3 Object Detection on COCO

【SENet】《Squeeze-and-Excitation Networks》
基于 Faster R-CNN，猛猛猛

5.4 Analysis and Interpretation

1）Reduction ratio

$r$ ：作者设置 reduction ratio 为 16，trade-off between model complexity and performance
【SENet】《Squeeze-and-Excitation Networks》

2）The role of Excitation

感觉是统计 SE block 在 squeeze 之后，excitation之前的 activation 情况，5 类，每类 50 个样本，average activations for fifty uniformly sampled channels

【SENet】《Squeeze-and-Excitation Networks》

作者有如下三个发现：

lower layer features are typically more general（例如（a）,说明特征共享）
higher layer features have greater specificity（例如（c）、（d），不同类别的不同 channels**值不一样）
（e）中，activation 为1，也即类似于 identity 了，所以在此处加不加 SE block 不是那么重要，不加的话还可以大量减少计算量，参考本博客 4.4 小节的分析！

小节

利用了 gate mechanism，有普适性，图 5 的关于特征的分析尤为重要，以及 reduction ratio（two fully connection 中）complexity 和 performance 的 trade off！