【PaperNotes】视频分类【一】

Appearance-and-Relation Networks for Video Classification

简介

提出ARTNet，学习端到端的视频表征
ARTNets是由多个SMART blocks堆叠而来的，SMART可用于同时从RGB输入中建模出appearance与relation
SMART将spatiotemporal学习分为两部分：用于空间建模的appearance分支与用于时间建模的relation分支
appearance分支：每帧的像素或是过滤器responses的线性结合
relation分支：跨多帧的像素与filter responses之间的multiplicative interactions
文中实验数据集：Kinetics, UCF101, HMDB51，证明了SMART blocks在时空域特征学习中表现优于3D卷积，且ARTNet SOTA

网络结构

Video architecture comparison
图a是一个two-stream CNNs，图b是3D CNNs:
【PaperNotes】视频分类【一】
图c就是本文提出的基于SMART building block的ARTNet的结构：

可以看到，two-stream CNNs有两个输入分别是RGB frame与optical flow，3D CNNs则是通过单个3D卷积联合且implicitly来建模的。ARTNet采取一种appearance与relation的分支方式，来分开的显示的同时建模。

SMART blocks

(a)中的3D卷积学习时空域特征jointly and implicitly
(b)中的square-pooling层结构是本文首次提出的，用于学习独立于appearance外的帧间关系学习
©进而构造出了SMART block来学习时空域特征separately and explicitly，下半部分是使用了2D卷积的appearance分支，用于捕捉静态结构，上半部分则是relation分支，使用了squaring-pooling结构了构建时间上的关系

【PaperNotes】视频分类【一】
Appearance branch
有一部分action类别与某些物体和场景类别是强相关的，这时静态的线索对于action recognition也十分重要。本文通过2D卷积操作于video volume V中来获取每一帧的空间结构。输出是Fvolume:F ∈ RWs×Hs×Ts×Cs，通常F还会过一层Batch Normalization和ReLU来实现非线性。

Relation branch
此分支作用域堆叠的连续帧用于捕捉帧间的关系来实现action recognition，这层的relation是十分重要的，因为它提供了motion cues。

ARTNet

The flexibility of the SMART block allows it to replace the role of a 3D convolution in learning spatiotemporal feature.

下面来看看如何把SMART block接入现有的网络结构来构造ARTNet：
【PaperNotes】视频分类【一】
网络输入是112x112x16的volume，在conv3_1，conv4_1与conv5_1使用2x2x2的stride进行下采样。

Experiments

【PaperNotes】视频分类【一】

Conclusion and Future Work

基于SMART block的ARTNet在Kinetics，UCF101和HMDB51数据集表现俱佳
加入optical flow来增强RGB输入在ARTNet中也取得了提升，说明光流态仍旧包含了互补的信息，但是其在现实应用中有着高计算开销。在未来希望提升ARTNet来解决single-stream和two-stream输入之间的performance gap
同时考虑扩展ARTNets为deeper 结构，以及在larger spatial resolutions训练

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

简介

hybrid深度学习视频分类框架，可以“model static spatial information, short-term motion, as well as long-term temporal clues in the videos”
空间和short-term motion特征由CNN分开提取，之后两个特征通过标准特征融合网络进行结合来做分类
LSTM网络在两个特征智商，来model出longer-term temporal clues
实验数据使用UCF-101 Human Actions(91.3%)和Columbia Consumer Videos(83.5%)

hybrid deep learning framework

【PaperNotes】视频分类【一】
下图是two-stream CNN，两个CNNs在第一个全连接层的输出用于spatial和short-term motion features做下一步的处理。

Experiments

【PaperNotes】视频分类【一】

Large-scale Video Classification with Convolutional Neural Networks

简介

受到CNNs在image recognition任务中表现的鼓舞，提出在large-scale视频分类领域中的extensive empirical evaluation of CNNs
扩展CNN在时间域的连接性，以利用局部的spatio-temporal信息并提出一个多分辨率的提高训练速度的结构

Our best spatio-temporal networks display significant per- formance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly mod- est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF- 101 Action Recognition dataset and observe significant per- formance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

网络结构

【PaperNotes】视频分类【一】

Conclusions

实验表现对时间上的连接的结构细节不敏感，a slow fusion model比early和late fusion alternatives表现更好
single-frame model表现很好，意味着local motion cues也许不那么重要，即便是对于如Sports这样的动态数据集，另一个解释是也许更为carefule treatment of camera motion是必要的（由于这多CNN的结构需要巨大变化，leave for future work）
包含有低分辨率的context和高分辨率的fovea stream的mixed-resolution architectures，是一个有效的无需牺牲accuracy的CNNs加速方法
UCF-101上的实验显示学到的特征是generic且可以泛化到其他视频分类任务
未来工作：incorporate broader categories; camera motion; explore RNNs for combining clip-level predictions into global video-level predictions

References

[1] Wang, L., Li, W., Li, W., & Van Gool, L. (2018). Appearance-and-Relation Networks for Video Classification, 1430–1439.
[2] Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., & Xue, X. (2015). Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification (pp. 461–470). Presented at the the 23rd ACM international conference, New York, New York, USA: ACM Press. http://doi.org/10.1145/2733373.2806222
[3] Zhou, W., Vellaikal, A., & Kuo, C. C. J. (2000). Rule-based video classification system for basketball video indexing (pp. 213–216). Presented at the the 2000 ACM workshops, New York, New York, USA: ACM Press. http://doi.org/10.1145/357744.357941
[4] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond Short Snippets: Deep Networks for Video Classification, 4694–4702.
[5] Karpathy, A., Toderici, G., Shetty, S., the, T. L. P. O., 2014. (n.d.). Large-scale video classification with convolutional neural networks. Cv-Foundation.org