行为动作识别（二）：SlowFast

1 Slowfast模型，一共有两个通道，一个Slow pathway，一个Fast pathway，其中Slow pathway采用低采样，高通道数主要提取空时特征；Fast pathway用高时间采样，低通道数量（主要为了降低计算量），来提取时域特征，两个通道都是以3Dresnet作为backbone，提取特征的，基础网络图如下：
行为动作识别（二）：SlowFast
2 对于两个通道都没有采用temporal downsampling，假设Slow pathway里面feature shape是 $\{T,S^2,C\}$ ，Fast pathway对应的feature shape $\{\alpha T,S^2,\beta C\}$ 。其中对于Slow pathway，只在res4和res5采用non-degenerate temporal convolutions (temporal kernel size > 1），即311的kernel size，因为发现在前面2个res采用会造成准确率下降，可能原因是We argue that this is because when objects move fast and the temporal stride is large, there is little correlation within a temporal receptive field unless the spatial receptive field is large enough(i.e., in later layers)。对于Fast pathway，全部采用non-degenerate temporal convolutions，因为pathway holds fine temporal resolution for the temporal convolutions to capture detailed motion。
3 横向连接，就是将Fast pathway的特征连接到Slow pathway特征上，论文一共提供了三种方法：
(i) Time-to-channel: 直接将 $\{\alpha T,S^2,\beta C\}$ reshape成 $\{ T,S^2,\alpha \beta C\}$
(ii) Time-strided sampling: 相当于降采样，对没 $\alpha$ 帧，提取一帧，变成 $\{ T,S^2,\beta C\}$
(iii) Time-strided convolution: 通过3D卷积，一个 $5*1^2$ 的kernel和 $2\beta C$ 的channel，以及对应的stride= $\alpha$ ，padding=2
最终两个特征通过求和或者连接形式保留，通过最终实验得出Time-strided convolution以及连接形式效果最好，作为默认设置
4 对于其中超参数 $\alpha ,\beta$ ，论文设置 $\alpha$ =8，对于 $\beta$ ，设置从1/32到1/4，最终取1/8
5 对于Fast pathway，采用Weaker spatial inputs，一共设置了half
spatial resolution，gray-scale，“time difference" frames，optical flow作为输入，发现灰度图和RGB图效果接近
行为动作识别（二）：SlowFast