SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

来源: CVPR2017
关键词：imaging captioning; attention mechanism
原文

Motivation

visual attention已经在image/video captioning 和 visual question answering 任务中取得了成功. 其合理之处在于人总是根据需要有选择性的观察图像的一部分.
文中将attention-based models 分为三类:

Spatial Attention: weighted pooling 损失了空间信息; attention is only applied in the last conv-layer.
Semantic Attention: these models require external resources to train these semantic attributes.
Multi-layer Attention: To overcome the weakness of large respective field size in the last conv-layer attention, Seo et al. [22] proposed a multi-layer attention networks.

但是,目前的attention-based image captioning model仅仅是考虑到spatial characteristic.比如, re-weight the last conv-layer feature map of a CNN encoding an input image.因此,本文提出了SCA-CNN,利用了多层3D feature maps的每个feature entry.
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
如上图所示, 首先, 不同的filter从不同的角度提取特征,channel-wise attention可以根据需要选择语义相关的特征.

For example, when we want to predict cake, our channel-wise attention(e.g., in the conv5_3/conv5_4 feature map) will assign more weights on channel-wise feature maps generated by filters according to the semantics like cake, fire, light, and candle-like shapes.

第二, 一个feature map取决于低层的feature maps, 所以层间注意力机制是很自然的.

For example, it is beneficial to emphasize on lower-layer channels corresponding to more elemental shapes like array and cylinder that compose cake

Spatial and Channel-wise Attention CNN

本文的做法总结为一句话就是:在encoder-decoder模型的基础上makes the original CNN multi-layer feature maps adaptive to the sentence context through channel-wise attention and spatial attention at multiple layers.
在第 $l$ 层, spatial and channel-wise attention weights $γ^{l}$ 是经过以 $h_{t - 1}$ 和 $V^{l}$ 为参数的函数计算得到的. $h_{t - 1} \in R^{d}$ 是LSTM的上一个隐状态. $V^{l}$ 是当前层的CNN features. 本文中的注意力机制通过下面式子来表达:

V^{l} = C N N (X^{l - 1})

γ^{l} = Φ (h_{t - 1}, V^{l})

X^{l} = f (V^{l}, γ^{l})

where X is the modulated feature,

f (\cdot)

is a linear weighting function that modulates CNN features and attention weights.
为了节省计算空间,作者提出了近似计算方法,spatial attention weights

α^{l}

和 channel-wise attention weights

β^{l}

分别计算:

α^{l} = Φ_{s} (h_{t - 1}, V^{l}),

β^{l} = Φ_{c} (h_{t - 1}, V^{l}) .

Spatial Attention

不失一般性.我们丢掉下标 $l$ ,并且把 $V$ 变为 $m = W H$ 个 $C$ 维向量, $V = [v_{1}, v_{2}, . . ., v_{m}]$ . $Φ_{s}$ 可通过下面的方式计算:
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Channel-wise Attention

CNN中的filter可以看做不同的pattern dector. 因此,对filter maps进行channel-wise attention可以看做对语义相关的特征的选择.
首先将 $V$ reshape为 $U = [u_{1}, u_{2}, . . . u_{C}]$ , 其中, $u_{i} \in R^{W \times H}$ , 再对每个feature map进行mean pooling得到 $v$ :

v = [v_{1}, v_{2}, . . ., v_{C}], v \in R^{C}

channel-wise attention model

Φ_{c}

的计算方法如下:

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

根据channel-wise attention and spatial attention的先后顺序，可以分为Channel-Spatial(C-S)和Spatial-Channel(S-C)两种类型的模型。

Result

略

Conclusions

优点：在spatial, channel-wise, and multi-layer三个层面都引入了attention机制，而不仅仅是spatial层面；相关工作总结的很好；实验比较翔实。
问题：增加attention的层数后易发生过拟合。