行为识别论文阅读|Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition 论文笔记

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, ECCV 2016, Amsterdam, Netherlands.

Temporal Segment Networks for Action Recognition in Videos, Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool, TPAMI, 2018.

Motivations

Modeling long-range temporal structure is crucial for human activity recognition.
Frames in a video are highly redundant.
- Modeling long-range temporal structure is not simply wrapping tons of frames. Frames are dense, but contents are sparse!
- 现有CNNs只能处理较短的时间序列超大计算量，且需要太多训练样本标注成本受限所以现有很多数据集都不大

Solutions

别人的方法：
1. 第一类：基于卷积的，Karpathy：fusion on Sports-1M、Simonyan：two-stream、Tran ：C3D、Sun：FSTCN（分解3D卷积核，加速计算）；还有一些论文处理更长时间的视频，用了CNN+RNN 结构，比如：
  
  Donahue et al. “Long-term recurrent convolutional networks for visual recognition and description”, CVPR 2015
  
  先用对每一帧用CNN提取，再用RNN串连 Feature Map，Good at long sequences
  
  Problem：RNN计算费时不能并行
  
  受计算限制，一般一次固定处理64-120帧，因为 limited temporal coverage 的缘故，所以模型很难学习到整个视频中的特征。应该对整个视频的动作区进行切片，再分段学习吧。
2. 第二类：基于时序建模的，Gaidon ：Action Sequence Model (ASM), Niebles: Latent SVM , Wang: Latent Hierarchical Model(LHM), Pirsiavash: Segmental Grammar Model (SGM). Wang: sequential skeleton model (SSM), Fernando : BoVW;
  
  —— 都不是end-to-end的；而我们提出的TSN是第一个framework for end-to-end temporal structure modeling on the entire videos.
Temporal Segment Networks (TSN): The Model

视频平均切K片，每片随机采样；RGB CNN 的结果aggregate在一起（绿色），Optical Flow CNN 的结果aggregate在一起（蓝色），concatenate then softmax
Segment-based Sparse Sampling

g函数可以是，averaging, weighted averaging, maximum; 可以end-to-end trainable

Applications

Winner of ActivityNet 2016 (93.2% mAP)

https://github.com/yjxiong/temporal-segment-networks

Experiments

原始 two-stream 用的是ZFNet的结构，这里作者用了Inception-V2（BN-Inception），因为更深deeper structures improve object recognition performance
网络输入的一些modalities,warped flow 使用的是iDT算法生成（Heng wang提出的很厉害的一个算法，可以减少相机运动造成的背景变化，聚焦于运动人物的变化）
现有行为识别数据集小，容易过拟合
Cross Modality Pre-training ：
1. 原始的RGB输入就用ImageNet上pretrain的模型
2. 其余modalities inputs就用将原本pretrain模型的第一个RGB卷积层，3个channel取平均，时序网络的第一层输入的所有channel都用这个平均值
3. 因为用了RGB图像上训练的模型做迁移学习，因此所有modalities都线性映射到0-255的范围内
Regularization Techniques
1. BN会加速convergence，但是产生overfitting，因为BN将avtivation都转为标准正态分布了，所以对于activation存在估计偏差。所以本文，冻结了除了第一个BN层以外的所有BN层的均值方差（partial BN ）。—— 因为考虑到RGB的activation分布和optical flow 的activation分布肯定不一样，BN后的均值和方差也一定不一样。
2. 在BN-Inception的global pooling layer中又加入了dropout，减小overfitting
Data Augmentation：corner cropping 和 scale-jittering
DeepDraw工具画的图

English Representations

Video-based action recognition has drawn a significant amount of attention
from the academic community [1-6]; CNNs have witnessed great success in classifying images of objects, scenes, and complex events[8-11]. Action recognition has been extensively studied in past few years [2,18,24–26]. （对于最新的一些文献的引用，按这样写很方便
In our view, the application of ConvNets in video-based action recognition
is impeded by two major obstacles. (在我们看来，xx遇到了两大阻碍)
To unleash the full potential of temporal segment network framework （释放模型的潜力
the temporal stream ConvNet takes a stack of consecutive optical flow fields as input.
We are also interested in exploring more input modalities to enhance the discriminative power of temporal segment networks.
avoid implicitly focusing on the center area of a image. （避免模型隐性关注中心区域，李飞飞2014那篇就是隐性关注中心区域哈

Advantages and Drawbacks

优点：

时序上进行分割，去冗余；每个分割部分使用独立的TSN来进行
使用迁移学习，减少训练时间
多种不同的输入变种，iDT算法生成的特征帮了大忙
冻结一部分参数，只训练少部分参数（观察到了BN层产生过拟合的原因——对**值分布的强制标准正态化
模型的可视化，发现迁移学习pretrain有很大的好处，更凸显了人物运动，学到的特性更好

缺点：

我以为它对原视频的切分是，按动作步进行识别后再切分，没想到是平均切分，影响在于——K设置的过大或者过小对一个数据集中的所有动作来说都是不公平的，如果一个动作完成周期很长很长，但是切分的过细的话，在spatial维度还好影响不大（除非动作与环境存在高度互动，本文warped optical flow使输入更集中于运动前景，在spatial维度影响就更小了），但在temporal维度上，每个网络都学不到什么东西哎；所以，我觉得如果能在输入特征上进行进一步改进一定会更好，不过本文作为开创性工作，重在idea好
这个任务上的迁移学习效果比直接训练好，大概率是因为数据集不够丰富的原因，个人认为ImageNet上的pretrain 的模型迁移到这个任务上有点玄学，尤其是时间维度上这些连接的学习，恐怕不太能迁移的很好，虽然作者对其他非RGB的Modalities输入都向rgb图像进行了靠拢，但是如果有个时序上学习的不错的模型用于迁移就棒棒的了