论文阅读：VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION（VGG）

文章目录

1、论文总述
2、 3*3卷积核的两个好处
3、The initialisation of the network weights
4、测试时候输出尺寸不固定的两种处理方法
5、训练与测试时候的不同输出尺度的对应计算关系

（1）单尺度测试
（2）多尺度测试

1、论文总述

这是一篇2015年的工作，提出了一个深层次的图像分类网络（最高到19层），后来常用的是VGG16，这篇论文中的实验证明了CNN网络越深，提取特征的能力越强，让网络变深的操作是利用小卷积核（3*3的卷积核），在不影响感受野的情况下可以减少网络的参数量，并且可以加深网络，但是作者只是加到19层，再往后就没提了，可以猜测，继续加深的话，会发生梯度消失的情况导致网络分类性能不升反降，这也就是后来resnet解决的问题。

In this paper, we address another important aspect of ConvNet architecture
design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the
depth of the network by adding more convolutional layers, which is feasible due to the use of very
small (3 × 3) convolution filters in all layers.

In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the
number of weights in our nets is not greater than the number of weights in a more shallow net with
larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).

论文阅读：VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION（VGG）
论文中，作者指出，虽然LRN(Local Response Normalisation)在AlexNet对最终结果起到了作用，但在VGG网络中没有效果，并且该操作会增加内存和计算，从而作者在更深的网络结构中，没有使用该操作。

2、 3*3卷积核的两个好处

So what have we gained by using, for instance, astack of three 3×3 conv. layers instead of a single 7×7 layer?
First, we incorporate three non-linear
rectification layers instead of a single one, which makes the decision function more discriminative.
Second, we decrease the number of parameters: assuming that both the input and the output of a
three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 3 32C2
= 27C2 weights; at the same time, a single 7 × 7 conv. layer would require 72C2 = 49C2 parameters, i.e.
81% more.
This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to
have a decomposition through the 3 × 3 filters (with non-linearity injected in between). 相当于引入了正则化

3、The initialisation of the network weights

先浅层训练，再深层训练，浅层训练时候权重随机初始化，深层训练时利用浅层的预训练权重

The initialisation of the network weights is important, since bad initialisation can stall learning due
to the instability of gradient in deep nets. To circumvent this problem, we began with training
the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when
training deeper architectures, we initialised the first four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did
not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.
For random initialisation (where applicable), we sampled the weights from a normal distribution
with the zero mean and 10 2 variance. The biases were initialised with zero. It is worth noting that
after the paper submission we found that it is possible to initialise the weights without pre-training
by using the random initialisation procedure of Glorot & Bengio (2010).

4、测试时候输出尺寸不固定的两种处理方法

一种是multi crop为训练时的尺寸，一种是把分类网络后面的全连接层变为卷积层，得到一个分类的score map 然后再进行平均池化之类的操作（Q是测试时候的尺寸）

At test time, given a trained ConvNet and an input image, it is classified in the following way. First,
it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it
as the test scale). We note that Q is not necessarily equal to the training scale S (as we will show
in Sect. 4, using several values of Q for each S leads to improved performance). Then, the network
is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely,
the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7
conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is
then applied to the whole (uncropped) image. The result is a class score map with the number of
channels equal to the number of classes, and a variable spatial resolution, dependent on the input
image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is
spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images;
the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores
for the image.

两种处理方法的实验结果对比：
论文阅读：VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION（VGG）

5、训练与测试时候的不同输出尺度的对应计算关系

（1）单尺度测试

We begin with evaluating the performance of individual ConvNet models at a single scale with the
layer configurations described in Sect. 2.2. The test image size was set as follows: Q = S for fixed
S, and Q = 0.5(Smin + Smax) for jittered S ∈ [Smin, Smax]. The results of are shown in Table 3.

Q = S 为单尺度训练时，S为训练尺度，Q为测试尺度
Q = 0.5(Smin + Smax) 为训练时加了 scale jittering时，就是先把输入图像rescale到一个尺寸，然后再crop 为224

论文阅读：VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION（VGG）

（2）多尺度测试

Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models
trained with fixed S were evaluated over three test image sizes, close to the training one: Q = {S - 32, S, S + 32}. At the same time, scale jittering at training time allows the network to be
applied to a wider range of scales at test time, so the model trained with variable S ∈ [Smin; Smax]
was evaluated over a larger range of sizes Q = {Smin, 0.5(Smin + Smax), Smax}.（三个尺度测试）

论文阅读：VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION（VGG）
参考文献

深度学习之基础模型-VGG

【深度学习】经典神经网络 VGG 论文解读