来源: CVPR2017
关键词:imaging captioning; attention mechanism
原文

Motivation

visual attention已经在image/video captioning 和 visual question answering 任务中取得了成功. 其合理之处在于人总是根据需要有选择性的观察图像的一部分.
文中将attention-based models 分为三类:

  • Spatial Attention: weighted pooling 损失了空间信息; attention is only applied in the last conv-layer.
  • Semantic Attention: these models require external resources to train these semantic attributes.
  • Multi-layer Attention: To overcome the weakness of large respective field size in the last conv-layer attention, Seo et al. [22] proposed a multi-layer attention networks.

但是,目前的attention-based image captioning model仅仅是考虑到spatial characteristic.比如, re-weight the last conv-layer feature map of a CNN encoding an input image.因此,本文提出了SCA-CNN,利用了多层3D feature maps的每个feature entry.
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
如上图所示, 首先, 不同的filter从不同的角度提取特征,channel-wise attention可以根据需要选择语义相关的特征.

For example, when we want to predict cake, our channel-wise attention(e.g., in the conv5_3/conv5_4 feature map) will assign more weights on channel-wise feature maps generated by filters according to the semantics like cake, fire, light, and candle-like shapes.

第二, 一个feature map取决于低层的feature maps, 所以层间注意力机制是很自然的.

For example, it is beneficial to emphasize on lower-layer channels corresponding to more elemental shapes like array and cylinder that compose cake

Spatial and Channel-wise Attention CNN

本文的做法总结为一句话就是:在encoder-decoder模型的基础上makes the original CNN multi-layer feature maps adaptive to the sentence context through channel-wise attention and spatial attention at multiple layers.
在第l层, spatial and channel-wise attention weightsγl是经过以ht1Vl为参数的函数计算得到的. ht1Rd是LSTM的上一个隐状态. Vl是当前层的CNN features. 本文中的注意力机制通过下面式子来表达:

Vl=CNN(Xl1)
γl=Φ(ht1,Vl)
Xl=f(Vl,γl)
where X is the modulated feature, f(·) is a linear weighting function that modulates CNN features and attention weights.
为了节省计算空间,作者提出了近似计算方法,spatial attention weights αl和 channel-wise attention weights βl分别计算:
αl=Φs(ht1,Vl),
βl=Φc(ht1,Vl).

Spatial Attention

不失一般性.我们丢掉下标l,并且把V变为m=WHC维向量, V=[v1,v2,...,vm]. Φs可通过下面的方式计算:
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Channel-wise Attention

CNN中的filter可以看做不同的pattern dector. 因此,对filter maps进行channel-wise attention可以看做对语义相关的特征的选择.
首先将Vreshape为U=[u1,u2,...uC], 其中, uiRW×H, 再对每个feature map进行mean pooling得到v:

v=[v1,v2,...,vC],vRC
channel-wise attention modelΦc的计算方法如下:
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

根据channel-wise attention and spatial attention的先后顺序,可以分为Channel-Spatial(C-S)和Spatial-Channel(S-C)两种类型的模型。

Result

Conclusions

优点:在spatial, channel-wise, and multi-layer三个层面都引入了attention机制,而不仅仅是spatial层面;相关工作总结的很好;实验比较翔实。
问题:增加attention的层数后易发生过拟合。

相关文章:

  • 2021-06-09
  • 2021-11-14
  • 2021-09-17
  • 2021-12-03
  • 2021-11-30
  • 2021-04-24
猜你喜欢
  • 2022-12-23
  • 2021-12-11
  • 2021-08-12
  • 2021-12-31
  • 2021-07-15
  • 2021-04-09
  • 2021-08-22
相关资源
相似解决方案