Appearance-and-Relation Networks for Video Classification
简介
- 提出ARTNet,学习端到端的视频表征
- ARTNets是由多个SMART blocks堆叠而来的,SMART可用于同时从RGB输入中建模出appearance与relation
- SMART将spatiotemporal学习分为两部分:用于空间建模的appearance分支与用于时间建模的relation分支
- appearance分支:每帧的像素或是过滤器responses的线性结合
- relation分支:跨多帧的像素与filter responses之间的multiplicative interactions
- 文中实验数据集:Kinetics, UCF101, HMDB51,证明了SMART blocks在时空域特征学习中表现优于3D卷积,且ARTNet SOTA
网络结构
Video architecture comparison
图a是一个two-stream CNNs,图b是3D CNNs:
图c就是本文提出的基于SMART building block的ARTNet的结构:
可以看到,two-stream CNNs有两个输入分别是RGB frame与optical flow,3D CNNs则是通过单个3D卷积联合且implicitly来建模的。ARTNet采取一种appearance与relation的分支方式,来分开的显示的同时建模。
SMART blocks
- (a)中的3D卷积学习时空域特征jointly and implicitly
- (b)中的square-pooling层结构是本文首次提出的,用于学习独立于appearance外的帧间关系学习
- ©进而构造出了SMART block来学习时空域特征separately and explicitly,下半部分是使用了2D卷积的appearance分支,用于捕捉静态结构,上半部分则是relation分支,使用了squaring-pooling结构了构建时间上的关系
Appearance branch
有一部分action类别与某些物体和场景类别是强相关的,这时静态的线索对于action recognition也十分重要。本文通过2D卷积操作于video volume V中来获取每一帧的空间结构。输出是Fvolume:F ∈ RWs×Hs×Ts×Cs,通常F还会过一层Batch Normalization和ReLU来实现非线性。
Relation branch
此分支作用域堆叠的连续帧用于捕捉帧间的关系来实现action recognition,这层的relation是十分重要的,因为它提供了motion cues。
ARTNet
The flexibility of the SMART block allows it to replace the role of a 3D convolution in learning spatiotemporal feature.
下面来看看如何把SMART block接入现有的网络结构来构造ARTNet:
网络输入是112x112x16的volume,在conv3_1,conv4_1与conv5_1使用2x2x2的stride进行下采样。
Experiments
Conclusion and Future Work
- 基于SMART block的ARTNet在Kinetics,UCF101和HMDB51数据集表现俱佳
- 加入optical flow来增强RGB输入在ARTNet中也取得了提升,说明光流态仍旧包含了互补的信息,但是其在现实应用中有着高计算开销。在未来希望提升ARTNet来解决single-stream和two-stream输入之间的performance gap
- 同时考虑扩展ARTNets为deeper 结构,以及在larger spatial resolutions训练
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
简介
- hybrid深度学习视频分类框架,可以“model static spatial information, short-term motion, as well as long-term temporal clues in the videos”
- 空间和short-term motion特征由CNN分开提取,之后两个特征通过标准特征融合网络进行结合来做分类
- LSTM网络在两个特征智商,来model出longer-term temporal clues
- 实验数据使用UCF-101 Human Actions(91.3%)和Columbia Consumer Videos(83.5%)
hybrid deep learning framework
下图是two-stream CNN,两个CNNs在第一个全连接层的输出用于spatial和short-term motion features做下一步的处理。
Experiments
Large-scale Video Classification with Convolutional Neural Networks
简介
- 受到CNNs在image recognition任务中表现的鼓舞,提出在large-scale视频分类领域中的extensive empirical evaluation of CNNs
- 扩展CNN在时间域的连接性,以利用局部的spatio-temporal信息并提出一个多分辨率的提高训练速度的结构
Our best spatio-temporal networks display significant per- formance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly mod- est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF- 101 Action Recognition dataset and observe significant per- formance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
网络结构
Conclusions
- 实验表现对时间上的连接的结构细节不敏感,a slow fusion model比early和late fusion alternatives表现更好
- single-frame model表现很好,意味着local motion cues也许不那么重要,即便是对于如Sports这样的动态数据集,另一个解释是也许更为carefule treatment of camera motion是必要的(由于这多CNN的结构需要巨大变化,leave for future work)
- 包含有低分辨率的context和高分辨率的fovea stream的mixed-resolution architectures,是一个有效的无需牺牲accuracy的CNNs加速方法
- UCF-101上的实验显示学到的特征是generic且可以泛化到其他视频分类任务
- 未来工作:incorporate broader categories; camera motion; explore RNNs for combining clip-level predictions into global video-level predictions
References
[1] Wang, L., Li, W., Li, W., & Van Gool, L. (2018). Appearance-and-Relation Networks for Video Classification, 1430–1439.
[2] Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., & Xue, X. (2015). Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification (pp. 461–470). Presented at the the 23rd ACM international conference, New York, New York, USA: ACM Press. http://doi.org/10.1145/2733373.2806222
[3] Zhou, W., Vellaikal, A., & Kuo, C. C. J. (2000). Rule-based video classification system for basketball video indexing (pp. 213–216). Presented at the the 2000 ACM workshops, New York, New York, USA: ACM Press. http://doi.org/10.1145/357744.357941
[4] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond Short Snippets: Deep Networks for Video Classification, 4694–4702.
[5] Karpathy, A., Toderici, G., Shetty, S., the, T. L. P. O., 2014. (n.d.). Large-scale video classification with convolutional neural networks. Cv-Foundation.org