[行为识别]VideoLSTM Convolves, Attends and Flows for Action Recognition

一、文章主要创新点

1.将乘法变成卷积

用convolution ALSTM(卷积注意力网络)代替传统的ALSTM(注意力网络)，说白了就是把LSTM和soft attention模型中所有的乘法都变成卷积，LSTM网络的输入不再是一个向量，而是一个二维的数组，这样可以保存feature map在空间上的关系。
[行为识别]VideoLSTM Convolves, Attends and Flows for Action Recognition
LSTM内部各个门的更新方程，乘法全部变成卷积

注意力模型，也变成卷积

2.改变attention模型

之前的注意力模型都是用当前状态的输入和上一时刻的隐藏状态，
通过加上一层bottom layer(其实就是一层LSTM)，这样ｔ时刻的注意力取决于frame t 的hidden state $H_{t}$ ,而不是前一帧的Ht-1
[行为识别]VideoLSTM Convolves, Attends and Flows for Action Recognition
bottom layer 其实就是最下面那一串圆圈所代表的LSTM.
bottom layer: generate the motion-based attention
参数更新如下

图中的参数：
previous hidden state from top layer Ht−1
Mt is the feature map extracted from optical flow image at timestep t.
按照原文中所说的，将注意力模型中的线性变换加tanh换成lstm cell，就能耦合当前帧的隐藏状态，这个解释也真是醉醉的。
with the updated LSTM cell the attention at frame t depends on the hidden state from the same frame t, instead of the previous frame t − 1.