VGGNet论文笔记

Title: Very Deep Convolutional Networks for large-scale image recognition（2014）

Link：paper

文章目录

abstract
1 Introduction
2 Convnet Configurations

2.1 Architecture
2.2 Configurations
2.3 Discussion

3 Classification Framework

3.1 Training
3.2 Testing
3.3 Implementation Details

4 Classification Experiments

4.1 Single scale evaluation
4.2 Multi-scale evaluation
4.3 Multi-crop evaluation
4.4 ConvNet fusion
4.5 Comparision with the state of the art

5 Conclusion

abstract

任务： large-scale image recognition

主要贡献：通过比较不同深度的网络，验证深度卷积网络（with very small 3×3 convolution filter）的深度配置对性能的影响。

这篇论文基于 ImageNet Challenge 2014，该团队分别取得了 localisa tion and classification tracks 的第一第二名。

因为团队名称叫 VGG，所以该模型叫 VGGNet。

1 Introduction

Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition

CNN在大规模图像，视频识别中取得巨大成功。主要体现在ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)这个比赛中。

之前的工作也在为更高的准确性改进CNN框架。

In this paper, we address another important aspect of ConvNet architecture design – its depth.

这篇论文，主要考虑网络的深度。不改变其他参数，通过添加更多的卷积层来增加网络深度。因为使用的卷积滤波器很小 (3 × 3)，所以这个方案可行。

最后得到了一个准确度更高的卷积网络，这个网络不但早ILSVRC classification and localisation tasks中取得了state-of-the-art accuracy，在其他图像识别数据集上表现也很好。

文章结构如下：

第二章：网络配置
第三章：图像分类训练和验证的细节
第四章：ILSVRC classification task中不同配置的比较
第五章：总结

2 Convnet Configurations

2.1 Architecture

convolution layers
input：224 × 224 RGB image
filter size：3 × 3（receptive field）
stride：1 pixel
pooling layer：5个 max-pooling 层，只有一些卷积层后面跟着 pooling 层（2 × 2 pixel window, with stride 2）
Fully-Connected (FC) layers 所有网络的FC层都一样
前两层 4096 channels
第三层 1000 channels（1000个类别）
第四层 soft-max layer
hidden layers
所有隐藏层的**函数都是 ReLU；
除了一个网络外，其余网络都没有采取 Local Response Normalisation (LRN) 这个正则化没有提高 ILSVRC dataset 上的网络性能，还耗时占内存。

2.2 Configurations

本论文的方法是对比不同深度的 CNN，一共设计了以下几种网络，每一列是一种配置，网络用 A 到 E 命名。

所有的网络配置都按照2.1中的设计，只有深度不同： 从 A 中的11层（8 conv. and 3 FC layers）到 E 中的19层（16 conv. and 3 FC layers）

卷积层的通道数量很小，从 64 开始，没经过一个max pooling 加 2，直到 512。

VGGNet论文笔记

2.3 Discussion

small respective filed:
与 ILSVRC-2012 和 ILSVRC-2013 比赛的 top-performing 网络相比，本文的网络用了 very small 3 × 3 receptive field with stride 1。

3 Classification Framework

3.1 Training

the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent with momentum.

learning strategy

SGD
batch size = 256
momentum = 0.9
weight decay = 0.0005
dropout ratio = 0.5
The learning rate was initially set to 0.01, and then decreased by a factor of 10 when the validation set accuracy stopped improving.

initialisation

按照上述 Table 1 配置网络，浅层网络训练时可以随机初始化，深层的网络训练时用 net A 的数据初始化前四个卷积层和最后三层全连接层，其余的中间层随机初始化。

随机初始化采用均值为 0 ，标准差为 0.01 的正态分布随机取样。

网络偏置（bias）初始化为 0。

training image size

用 S 代表 training image 的最小规模，当裁剪尺寸固定为 224 × 224, S = 224，可以直接输入整个图像，党 S >> 224，图片就会裁剪。

训练规模 S 有两种设定方法。

The first is to fix S, which corresponds to single-scale training. The second approach to setting S is multi-scale training.

第一种方法是固定 S，适用于单尺寸训练；第二种方法的 S 是从一个确定的区间随机取样，适用于多尺寸训练，这可以看作一种数据增强（data augmentation）

3.2 Testing

给定一个训练好的网络和输入图像，按如下步骤完成分类任务：

首先图像调节（rescale）到预先定义的最小尺寸，记为 Q；
接着把网络应用到 Q (rescaled test image)，全连接层可以看作 1 × 1 的卷积层，所以整个网络可以看作全卷积网络（fully-convolutional net）；
最后得到一个固定尺寸的类别向量。输出的向量通道数与类别数相等。

可以通过水平翻转做测试图片的数据增强，用原始图像和翻转图像得到分数的平均值作为最后预测结果。