Faster R-CNN详解

Faster R-CNN

Pytorch实现：https://github.com/jwyang/faster-rcnn.pytorch/tree/pytorch-1.0
专门的Region Proposal模块是当前的速度瓶颈,Faster R-CNN 直接用CNN (Region Proposal Network, RPN) 来生成Region Proposal，并且和第二阶段的CNN共享卷积层。
整体框架如下，可以看作是Fast R-CNN+ RPN，其中RPN用来生成候选框Region Proposal。
Faster R-CNN详解

RPN

Anchor

实际上为不同尺度的矩形框，用来框出原始图像中的目标，具有scale和ratio两个参数。作者使用了3个scale：{128,256,512}和3和ratio：{1:1，1:2，2:1}，这些框基本满足了原始图像的不同目标。
Faster R-CNN详解
9种anchors只是初始的检测框，后面的bounding box regression会会对检测框进行微调。下面展示了对于给定3D的feature map得到进行分类和回归检测框：

假设我们VGG作为backbone，初始图像的输入大小为 $800\times 600$ ,VGG下采样16倍得到的feature map的spatial size为 $ceil(800/16)\times ceil(600/16)=50\times 38$ 。首先通过一个普通的 $3\times 3$ 的卷积得到新的3Dfeature map，其维度为 $50\times 38\times 256$ ,然后经过2路的 $1\times 1$ 卷积分别得到 $50\times 38\times 2k$ 和 $50\times 38\times 4k$ ，注意这里分类层cls layer和回归层reg layer都是针对3D的feature map进行操作，相当于全卷积的方式。

介绍完过程，介绍一下原因。首先对于spatial size为 $50\times 38$ 的feature map，每一个像素点都对应着提前定义好的 $k$ 个anchor，所以需要生成 $50\times 38\times k$ 个anchors。对于最终生成的 $50\times 38\times 2k$ 和 $50\times 38\times 4k$ 2个feature map，每个像素分别对应 $2k$ 个数和 $4k$ 个数。即为 $k$ 个anchor是前景还是后景的概率，还有 $k$ 个anchor的4个位置坐标。

对于分类，论文中将满足以下两种规则的 anchor 判定为 positive：

anchor 与任一目标的 ground truth box 的IoU值最大
anchor 与任一目标的 ground truth box 的IoU值 > 0.7

因此，存在一个 ground truth 对应多个 anchor 的情况。当 IoU 值 < 0.3 时，则认为 anchor 为 non-positive，即背景。除此之外，其他的 anchor 不参与训练。

RPN 网络的 loss 函数同样是一个多任务的loss，包含两个部分，classification 的 loss 和 regression 的 loss：
$L(\{p_{i},t_{t}\})_{i=1,2,...,N}=\frac{1}{N_{cls}}\sum_{i}L_{cls}(p_{i},p_{i}')+\lambda\frac{1}{N_{reg}}\sum_{i}L_{reg}(t_{i},t_{i}')$
$i$ 是一个 mini-batch 下 anchor 的索引。 $p_{i}$ 是 anchor 为目标的概率。当anchor为目标时， $p_i'$ 为1，否则为0。 $t_i$ 是预测框的位置坐标， $t_i'$ 是ground truth的坐标。 $N_{cls}$ 为mini-batch的大小（如 $N_{cls}$ = 256）， $N_{reg}$ 为anchor的数量（如 $N_{cls}$ ~ 2400),作者在实验中验证 $\lambda$ 为10效果最好。
$L_{cls}(p_{i},p_{i}^{*})=-\log[p_{i}*p_{i}'+(1-p_{i})(1-p_{i}')]\\ L_{reg}(t_{i},t_{i}')=smooth\_l1\_loss(t_{i}-t_{i}')$

采用交替训练(Alternating Train)
（1）使用预训练的Backbone训练RPN，并finetune backbone。
（2）用基于上一步得到的RPN生成的proposal来训练Faster R-CNN，并finetune backbone。
（3）固定共享的卷积层Backbone，训练PRN。
（4）固定共享的卷积层，基于上一步得到的RPN生成的proposal来训练训练Fast R-CNN

可以看到训练过程类似于一种“迭代”的过程，不过只执行了一次。至于只循环了2次的原因是应为作者提到：“A similar alternating training can be run for more iterations, but we have observed negligible improvements”，即循环更多次没有提升了。