1 Slowfast模型,一共有两个通道,一个Slow pathway,一个Fast pathway,其中Slow pathway采用低采样,高通道数主要提取空时特征;Fast pathway用高时间采样,低通道数量(主要为了降低计算量),来提取时域特征,两个通道都是以3Dresnet作为backbone,提取特征的,基础网络图如下:
行为动作识别(二):SlowFast
2 对于两个通道都没有采用temporal downsampling,假设Slow pathway里面feature shape是{T,S2,C}\{T,S^2,C\},Fast pathway对应的feature shape{αT,S2,βC}\{\alpha T,S^2,\beta C\}。其中对于Slow pathway,只在res4和res5采用non-degenerate temporal convolutions (temporal kernel size > 1),即311的kernel size,因为发现在前面2个res采用会造成准确率下降,可能原因是We argue that this is because when objects move fast and the temporal stride is large, there is little correlation within a temporal receptive field unless the spatial receptive field is large enough(i.e., in later layers)。对于Fast pathway,全部采用non-degenerate temporal convolutions,因为pathway holds fine temporal resolution for the temporal convolutions to capture detailed motion。
3 横向连接,就是将Fast pathway的特征连接到Slow pathway特征上,论文一共提供了三种方法:
(i) Time-to-channel: 直接将{αT,S2,βC}\{\alpha T,S^2,\beta C\}reshape成{T,S2,αβC}\{ T,S^2,\alpha \beta C\}
(ii) Time-strided sampling: 相当于降采样,对没α\alpha帧,提取一帧,变成{T,S2,βC}\{ T,S^2,\beta C\}
(iii) Time-strided convolution: 通过3D卷积,一个5125*1^2的kernel和2βC2\beta C的channel,以及对应的stride=α\alpha,padding=2
最终两个特征通过求和或者连接形式保留,通过最终实验得出Time-strided convolution以及连接形式效果最好,作为默认设置
4 对于其中超参数α,β\alpha ,\beta,论文设置α\alpha=8,对于β\beta,设置从1/32到1/4,最终取1/8
5 对于Fast pathway,采用Weaker spatial inputs,一共设置了half
spatial resolution,gray-scale,“time difference" frames,optical flow作为输入,发现灰度图和RGB图效果接近
行为动作识别(二):SlowFast

相关文章: