【时空序列预测第十篇】Cubic LSTMs for Video Prediction

一、Address

AAAI2019的一篇文章

Cubic LSTMs for Video Prediction

论文链接地址: https://arxiv.org/pdf/1904.09412.pdf

【时空序列预测第十篇】Cubic LSTMs for Video Prediction

二、Introduction and Model

2.1 LSTM以及ConvLSTM

这里简单列举下，其实我已经写过很多次了，只是为了方便大家阅读

LSTM结构具体参考文章：

公式为：
$\mathrm{FC}-\mathrm{LSTM}\left\{\begin{array}{l} i_{t}=\sigma\left(\mathcal{W}_{i} \cdot\left[\mathcal{X}_{t}, \mathcal{H}_{t-1}\right]+b_{i}\right) \\ f_{t}=\sigma\left(\mathcal{W}_{f} \cdot\left[\mathcal{X}_{t}, \mathcal{H}_{t-1}\right]+b_{f}\right) \\ o_{t}=\sigma\left(\mathcal{W}_{o} \cdot\left[\mathcal{X}_{t}, \mathcal{H}_{t-1}\right]+b_{o}\right) \\ c_{t}=\tanh \left(\mathcal{W}_{c} \cdot\left[\mathcal{X}_{t}, \mathcal{H}_{t-1}\right]+b_{c}\right) \\ \mathcal{C}_{t}=f_{t} \odot \mathcal{C}_{t}+i_{t} \odot c_{t} \\ \mathcal{H}_{t}=o_{t} \odot \tanh \left(\mathcal{C}_{t}\right) \end{array}\right.$

缩写可以表示为：
$\begin{array}{l} \mathrm{FC}-\mathrm{LSTM}: \\ \quad\left(\mathcal{C}_{t}, \mathcal{H}_{t}\right)=\mathrm{LSTM}\left(\mathcal{X}_{t},\left(\mathcal{C}_{t-1}, \mathcal{H}_{t-1}\right) ; \mathcal{W}, b ; \cdot\right) \end{array}$

ConvLSTM论文解读具体参考文章：

缩写可以表示为：

$\left(\mathcal{C}_{t}, \mathcal{H}_{t}\right)=\operatorname{LSTM}\left(\mathcal{X}_{t},\left(\mathcal{C}_{t-1}, \mathcal{H}_{t-1}\right) ; \mathcal{W}, b ; *\right)$

2.2 CubicLSTM

2.2.1 结构

【时空序列预测第十篇】Cubic LSTMs for Video Prediction
左面是一个立体图，很难看明白其中的结构，咱们主要看右边的b，拓扑图。

整理的cell最主要的特点是将temporal和spatial的信息分开处理，即左图中的sptial axis和temporal axis两个轴方向。

CnbicLSTM包括三个branch：temporal branch， spatial branch， output branch，顾名思义temporal branch主要是获得动作的，也就是目标之间的变化即时间的运动信息，spatial branch主要是获取本身目标的结构信息即目标空间信息，output branch就是把二者做了个整合，之后输出prediction

【时空序列预测第十篇】Cubic LSTMs for Video Prediction

再仔细看一下结构，感觉不需要多说，这幅图画的很清楚了，大致来讲就是蓝色为spatial,橘黄色为temporal。
可以看到空间上，主要是layer的变化，也就是L->L+1，而时间上主要是step的变化，也就是t->t+1

大致说下流程，就是temporal的hidden state和spatial的hidden state以及输入x经过两个conv分别得到时间维度和空间维度的三个门控以及内部状态，之后接下来的各种操作和convLSTM里面基本上一样，最后得到两个hidden state经过conv得到最终的Yt

其实换一个角度本篇的这个结构本质上就是个两个convLSTM的拼接cell，只是生成门控的输入为三个值所决定的。

所以公式也很好得出。

2.2.2 公式

$\text { CubicLSTM }\left\{\begin{array}{l} \text { temporal branch }:\left(\mathcal{C}_{t, l}, \mathcal{H}_{t, l}\right)=\operatorname{LSTM}\left(\mathcal{X}_{t, l}, \mathcal{H}_{t, l-1}^{\prime},\left(\mathcal{C}_{t-1, l}, \mathcal{H}_{t-1, l}\right) ; \mathcal{W}, b ; *\right) \\ \text { spatial branch : }\left(\mathcal{C}_{t, l}^{\prime}, \mathcal{H}_{t, l}^{\prime}\right)=\operatorname{LSTM}\left(\mathcal{X}_{t, l}, \mathcal{H}_{t-1, l},\left(\mathcal{C}_{t, l-1}^{\prime}, \mathcal{H}_{t, l-1}^{\prime}\right) ; \mathcal{W}^{\prime}, b^{\prime} ; *\right) \\ \text { output branch : } \mathcal{Y}_{t, l}=\mathcal{W}^{\prime \prime} *\left[\mathcal{H}_{t, l}, \mathcal{H}_{t, l}^{\prime}\right]+b^{\prime \prime} \end{array}\right.$