R-CNN 系列论文简述（DL）

R-CNN

The Region-based Convolutional Network method (RCNN) achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals.

1.1 Question

Can CNN be used for object detection?

2 Model

R-CNN 系列论文简述（DL）

1.3dvantage

Combine region proposal and CNN for object detection, and improved the accuracy rate of mAP.

1.4 Weakness

1.4.1 Training is a multi-stage pipeline. SVM and bounding-box regressors are divided

1.4.2 Need store features, so training is expensive in space and time

1.4.3 The input image is warped, which may result in unwanted geometric

2 SPPnet

Spatial pyramid pooling networks (SPPnets) were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max pooling the portion of the feature map inside the proposal into a fixed-size output. Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling.

2.1 Question

Can an arbitrary size replace an fixed size as an input image? Can each region share computation?

1. Model

1. Advantage
  1. Training with variable-size images increases scale-invariance and reduces over-fitting
  2. Be able to generate a fixed-length output regardless of the input size
  3. SPP uses multi-level spatial bins, and multi-level pooling is robust to object deformations
  4. SPP can pool features extracted at variable scales thanks to the flexibility of input scales
2. Weakness
  1. Like R-CNN, training is a multi-stage pipeline, and features are also written to disk.
  2. The fine-tuning algorithm proposed in cannot update the convolutional layers that precede the spatial pyramid pooling.
Fast R-CNN

For each object proposal, a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes.

1. Question

Can use single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations improve the accuracy and simplify the process?

1. Model

ROI（region of instrest）pooling

Multi-task loss

R-CNN 系列论文简述（DL）

in which

R-CNN 系列论文简述（DL）

1. Advantage

The model is single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations. mAP is more accurate and speed is more high.

1. Weakness

ROI prejection nend other model, which is time-consuming.

Faster R-CNN

Introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look.

1. Question

Can we get region proposal and CNN feature together using a share CNN.

1. Model

The reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal.

Anchor: the k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. By default the model use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For a convolutional feature map of a size W × H (typically ∼2,400), there are WHk anchors in total.

1. Advantage
  1. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free.
  2. The method enables a unified, deep-learning-based object detection system to run at near real-time frame rates.
  3. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
2. Weakness

Need alternating traning, and can not use singel net’s training to simultaneously achieve multitask.

The problem need to be solved

Selective search need to learn.
In SPPnet, the fine-tuning algorithm proposed in cannot update the convolutional layers that precede the spatial pyramid pooling, but in fast R-CNN, the fine-tuning algorithm is end-to end. Why does this happen? In what circumstances can the final pooling layer be able to transfer errors?
What will be the results of model if faster R-CNN use multi-level ROI pooling?


Model