【CV-Paper 03】VGGNet-2014

论文原文：LINK
论文年份：2014
论文被引：42645(20/08/2020)

文章目录

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Abstract
1 INTRODUCTION
2 CONVNET CONFIGURATIONS

2.1 ARCHITECTURE
2.2 CONFIGURATIONS
2.3 DISCUSSION

3 CLASSIFICATION FRAMEWORK

3.1 TRAINING
3.2 TESTING
3.3 IMPLEMENTATION DETAILS

4 CLASSIFICATION EXPERIMENTS

4.1 SINGLE SCALE EVALUATION
4.2 MULTI-SCALE EVALUATION
4.3 MULTI-CROP EVALUATION
4.4 CONVNET FUSION
4.5 COMPARISON WITH THE STATE OF THE ART

5 CONCLUSION

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Abstract

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

在这项工作中，我们研究了卷积网络深度对其在大型图像识别环境中的准确性的影响。我们的主要贡献是使用具有非常小（3×3）卷积滤波器的架构对增加深度的网络进行全面评估，这表明将深度增加至16-19权重层可以实现对现有技术配置的重大改进。这些发现是我们提交2014年ImageNet挑战赛的基础，我们的团队分别在定位和分类任务上获得了第一名和第二名。我们还表明，我们的表示法可以很好地推广到其它数据集，在这些数据集中它们可以达到最新的结果。我们已经公开提供了两个性能最佳的ConvNet模型，以促进对在计算机视觉中使用深层视觉表示的进一步研究。

1 INTRODUCTION

Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played bythe ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011)to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).

卷积网络（ConvNets）最近在大规模图像和视频识别方面取得了巨大成功（Krizhevsky等，2012; Zeiler＆Fergus，2013; Sermanet等，2014; Simonyan＆Zisserman，2014），这已经成为可能归功于大型公共图像存储库，例如ImageNet（Deng等，2009）和高性能计算系统，例如GPU或大型分布式集群（Dean等，2012）。尤其是ImageNet大规模视觉识别挑战赛（ILSVRC）（Russakovsky et al，2014）在深度视觉识别架构的发展中发挥了重要作用，它已成为几代大型尺度图像分类系统，从高维浅层特征编码（Perronnin等，2010）（ILSVRC-2011的获奖者）到深层ConvNets（Krizhevsky等，2012）（ILSVRC-2012的获奖者）。

With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers.

随着ConvNets在计算机视觉领域中越来越普遍，人们进行了许多尝试来改进Krizhevsky等人的原始体系结构（2012），以达到更好的准确性。例如，向ILSVRC2013提交的表现最好的论文（Zeiler＆Fergus，2013； Sermanet等，2014）利用较小的接收窗口大小和较小的第一卷积层步长。另一项改进涉及在整个图像上和多个尺度上密集地训练和测试网络（Sermanet等，2014； Howard，2014）。在本文中，我们讨论了ConvNet体系结构设计的另一个重要方面–深度。为此，我们修复了体系结构的其他参数，并通过添加更多的卷积层来稳步增加网络的深度，这是可行的，因为在所有层中都使用了非常小的（3×3）卷积滤波器。

As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicableto otherimage recognitiondatasets, wherethey achieveexcellentperformanceevenwhen used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models1 to facilitate further research.

结果，我们提出了更为精确的ConvNet架构，该架构不仅可以实现ILSVRC分类和定位任务的最新精度，而且还可以应用于其它图像识别数据集，即使它们被用作相对较优的一部分，也可以实现出色的性能简单的流水线（例如，通过线性SVM对深层特征进行分类而无需进行微调）。我们发布了两个性能最佳的模型1，以促进进一步的研究。

The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.

本文的其余部分安排如下。第2部分，描述我们的ConvNet配置。第3部分，介绍了图像分类训练和评估的详细信息。第4部分，在ILSVRC分类任务上比较了配置。第5部分得出本文结论。为了完整起见，我们还在附录A中描述和评估了我们的ILSVRC-2014目标定位系统，并在附录B中讨论了将很深的特征推广到其他数据集的情况。最后，附录C包含主要论文修订列表。

2 CONVNET CONFIGURATIONS

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations(Sect.2.1)and then detail the specific configurations used in the evaluation(Sect.2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.

为了在公平的环境下衡量不断增加的ConvNet深度带来的改进，我们的所有ConvNet层配置均采用相同的原理设计，并受到（Ciresan等人，2011）；（Krizhevsky等，2012）的启发。在本节中，我们首先描述ConvNet配置的通用布局（第2.1节），并详细介绍评估中使用的特定配置（第2.2节）。然后讨论我们的设计选择，并将其与现有技术进行比较（第2.3节）。

2.1 ARCHITECTURE

During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

在训练过程中，ConvNets的输入是固定尺寸的224×224 RGB图像。我们唯一要做的预处理是从每个像素中减去在训练集上计算出的RGB平均值。图像通过一叠卷积层传递，在这里我们使用具有很小接收域的滤波器：3×3（这是捕获左/右，上/下，中心的概念的最小尺寸）。在一种配置中，我们还利用1×1卷积滤波器，这可以看作是输入通道的线性变换（其次是非线性）。卷积步幅固定为1个像素；卷积层的空间填充输入是这样的，即在卷积后保留空间分辨率，比如对于3×3的卷积层而言，padding=1。空间池化由五个最大池化层执行，这五个层都跟随在卷积层之后（并非所有卷积层之后都有最大池化层）。最大池化在跨度为2的2×2像素窗口上执行。

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

卷积层的堆叠（在不同的体系结构中具有不同的深度）之后是三层全连接（FC）层：前两层各具有4096个通道，第三层进行1000类的 ILSVRC分类，因此包含1000个通道（每个通道一个类）。最后一层是soft-max层。在所有网络中，全连接层的配置都是相同的。

All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).

所有隐藏层都配备了非线性校正单元 ReLU（Krizhevsky et al，2012）。我们注意到，我们的网络（除了一个网络）均不包含本地响应规范化（LRN）规范化（Krizhevsky等人，2012）：这将在本节中进行展示。如图4所示，这种归一化不能改善ILSVRC数据集的性能，但是会导致内存消耗和计算时间增加。 LRN层的参数只适用（Krizhevsky et al，2012）的参数。

2.2 CONFIGURATIONS

The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv.and 3 FC layers) to 19 weight layers in the network E (16 conv.and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

表1中概述了本文评估的ConvNet配置，每列一个。在下文中，我们将通过名称（A–E）来指称网络。所有配置均遵循 2.1节中介绍的通用设计，仅在深度上有所不同：从网络A的11个权重层（8个卷积层和3个FC层）到网络E的19个权重层（16个卷积层和3个FC层）。卷积层的宽度（通道数）很小，从第一层的64个层开始，然后在每个最大池化层之后增加2倍，直到达到512个为止。

In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).

在表2中，我们报告了每种配置的参数数量。尽管深度较大，但我们的网络中的权重数量少于具有较大的conv层宽度和接收场的较浅的网络中的权重数量。（Sermanet et al，2014中的144M权重）。

2.3 DISCUSSION

Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×5; three such layers have a 7 × 7 effective receptive field. So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by $3(3^2C^2)= 27C^2$ weights; at the same time, a single 7 × 7 conv. layer would require $7^2C^2 = 49C^2$ parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

我们的ConvNet配置与ILSVRC-2012（Krizhevsky等人，2012）和ILSVRC-2013竞赛（Zeiler＆Fergus，2013; Sermanet等人，2014）中具有最佳表现的网络所使用的配置完全不同。我们并未在第一层卷积中使用相对较大的接收场，例如，（Krizhevsky等人，2012）步幅为4的11×11，或（Zeiler＆Fergus，2013; Sermanet等人，2014）步幅为2的7×7，我们使用非常小的3×3接收层整个网络中的所有场都与每个像素的输入进行卷积（步幅为1）。很容易看到两个3×3的卷积层堆叠（中间没有空间池化）的有效接收场为5×5；三个这样的层具有7×7的有效接收场。为什么使用三个3×3的卷积层堆叠而不是单个7×7的卷积层？首先，我们合并了三个非线性整流层，而不是单个整流层，这使得决策函数更具判别力。其次，我们减少参数的数量：假设一个三层3×3卷积堆叠的输入和输出都具有C个通道，则该堆栈的参数设置为 $3(3^2C^2)= 27C^2$ 权重；同时，进行一次7×7转换。层将需要 $7^2C^2 = 49C^2$ 参数，即增加81％。这可以看作是对7×7的卷积层滤波器强加了正则化。迫使它们通过3×3滤波器（在其间注入非线性）进行分解。

Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “conv<receptive field size>-<number of channels>”. The ReLU activation function is not shown for brevity.

表1：ConvNet配置（在列中显示）。随着添加更多的层（添加的层以粗体显示），配置的深度从左（A）到右（E）增大。卷积层参数被表示为“感受野大小-通道数”。为简洁起见，未显示ReLU**函数。
【CV-Paper 03】VGGNet-2014

The incorporation of 1 × 1 conv. layers (configuration C, Table 1) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1×1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1×1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).

合并1×1卷积层（配置C，表1）是一种增加决策函数的非线性度而又不影响转换函数接收场的方法。即使在我们的情况下，1×1卷积实际上是在相同维数空间上的线性投影（输入和输出通道的数量相同），但整流函数会引入其他非线性。应该注意：（Lin等人，2014）最近在“Network in Network”架构中使用了1×1卷积层。

Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets (22 weight layers) and small convolution filters (apart from 3 × 3, they also use 1 × 1 and 5 × 5 convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.

（Ciresan等人，2011）先前已使用小型卷积滤波器，但它们的网络深度远小于我们的网络，并且他们没有对大规模ILSVRC数据集进行评估。（Goodfellow等，2014）将深层ConvNets（11个权重层）应用于街道编号识别任务，并表明增加的深度导致更好的性能。 GoogLeNet（Szegedy等人，2014）是ILSVRC-2014分类任务中表现最好的一项，它是独立于我们的工作而开发的，但相似之处在于它基于非常深的ConvNet（22个权重层）和小的卷积过滤器（除了3×3，它们还使用1×1和5×5卷积）。但是，它们的网络拓扑比我们的网络拓扑更复杂，并且在第一层中更加积极地降低了特征图的空间分辨率，从而减少了计算量。将在 4.5节中所示，我们的模型在单网络分类准确性方面优于 GoogLeNet（Szegedy等人，2014）。

3 CLASSIFICATION FRAMEWORK

In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation.

在上一节中，我们介绍了网络配置的详细信息。在本节中，我们将描述分类ConvNet训练和评估的详细信息。

3.1 TRAINING

The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (the L2 penalty multiplier set to 5·10−4) and dropout regularisation for the first two fully-connected layers (dropoutratio set to 0.5). The learning rate was initially set to 10−2, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.

ConvNet 的训练过程通常遵循（Krizhevsky等人，2012）的观点。（除了从多尺度训练图像中对输入裁剪进行采样，如下所述）。即训练是使用带动量的小批量梯度下降反向传播（LeCun等，1989）优化多项式逻辑回归目标来进行的。批次大小设置为256，动量设置为0.9。配置权重衰减正则化（L2罚因子设置为 $5·10^{-4}$ ）和前两个全连接层的Dropout正则化（dropout rate 设置为 0.5）。最初将学习率设置为 $10^{-2}$ ，然后在验证集准确性停止提高时降低10倍。总体而言，学习率降低了3倍，并且在370K次迭代（74个epoch）之后停止了学习。我们推测，尽管与（Krizhevsky等人，2012）相比，网络的参数数量更大且深度更大，但由于（a）更大的深度和更小的卷积核尺寸和（b）某些层的预初始化所带来的隐式正则化，使得网络收敛所需的epoch更少。

The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when trainingdeeperarchitectures, we initialised the first fourconvolutionallayers andthe last three fullyconnectedlayers with the layers of net A (the intermediatelayers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and 10−2variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).

网络权重的初始化很重要，因为糟糕的初始化会因深网中的梯度不稳定而使学习停滞。为了解决这个问题，我们从训练配置A（表1）开始，该配置足够浅，可以通过随机初始化进行训练。然后，当训练较深的体系结构时，我们用网络A的层初始化了前四个卷积层和最后三个完全连接的层（中间层是随机初始化的）。我们没有降低预初始化层的学习速度，而是允许它们在学习过程中进行更改。对于随机初始化（如果适用），我们从正态分布中以零均值和 $10^{-2}$ 方差采样权重。偏差以零初始化。值得注意的是，在论文提交后，我们发现可以通过使用 Glorot＆Bengio（2010）的随机初始化程序在不进行预训练的情况下初始化权重。

To obtain the fixed-size224×224ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below.

为了获得固定尺寸的224×224ConvNet输入图像，它们是从重新缩放的训练图像中随机裁剪的（每个SGD迭代每个图像一个裁剪）。为了进一步增强训练集，对裁剪图像进行了随机的水平翻转和随机RGB色移（Krizhevsky等人，2012）。训练图像重缩放将在下面说明。

Training image size. Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale). While the crop size is fixed to 224 × 224, in principle S can take on any value not less than 224: for S = 224 the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for S ≫ 224the crop will correspond to a small part of the image, containing a small object or an object part.

训练图像大小。令S为各向同性缩放后的训练图像的最小面，从中裁剪ConvNet输入（我们也将S称为训练尺度）。当裁剪大小固定为224×224时，原则上S可以取不小于224的任何值：对于S = 224，裁剪将捕获完整图像统计信息，完全跨越训练图像的最小侧面；对于S≫ 224，裁剪将对应于图像的一小部分，其中包含一个小物体或一个物体部分。

We consider two approaches for setting the training scale S. The first is to fix S, which corresponds to single-scale training (note that image content within the sampled crops can still represent multiscale image statistics). In our experiments, we evaluated models trained at two fixed scales: S = 256 (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and S = 384. Given a ConvNet configuration, we first trained the network using S = 256. To speed-up training of the S = 384 network, it was initialised with the weights pre-trained with S = 256, and we used a smaller initial learning rate of 10−3.

我们考虑两种设置训练尺度S的方法。第一种是固定S，它对应于单尺度训练（请注意，采样裁剪中的图像内容仍可以表示多尺度图像统计信息）。在我们的实验中，我们评估了在两个固定比例上训练的模型：S = 256（已在现有技术中广泛使用（Krizhevsky等，2012； Zeiler＆Fergus，2013； Sermanet等，2014））和S =384。在使用ConvNet配置的情况下，我们首先使用S = 256训练网络。为了加快对S = 384网络的训练，它使用预先训练的权重S = 256进行初始化，并且我们使用了较小的初始值学习率 $10^{−3}$ 。

The second approach to setting S is multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] (we used Smin= 256 and Smax= 512). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed S = 384.

设置S的第二种方法是多尺度训练，其中通过从某个范围[Smin，Smax]（我们使用Smin = 256和Smax = 512）中随机采样S来分别缩放每个训练图像的大小。由于图像中的对象可以具有不同的大小，因此在训练过程中考虑这一点是有益的。这也可以看作是通过尺度抖动进行训练集增强，其中训练单个模型以识别大范围尺度上的对象。出于速度原因，我们通过对具有相同配置的单尺度模型的所有层进行微调来训练多尺度模型，并使用固定S = 384进行了预先训练。

3.2 TESTING

At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale). We note that Q is not necessarily equal to the training scale S (as we will show in Sect. 4, using several values of Q for each S leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtainthe final scores for the image.

在测试时，给定训练有素的ConvNet和输入图像，可以按以下方式对其进行分类。首先，将它各向同性地缩放到预定义的最小图像面，表示为Q（我们也将其称为测试比例）。我们注意到，Q不一定等于训练规模S（如我们将在第4节中所示，对每个S使用多个Q值可以改善性能）。然后，以类似于（Sermanet et al，2014）的方式将网络密集地应用到重新缩放的测试图像上。即首先将全连接层转换为卷积层（第一个FC层转换为7×7卷积层，最后两个FC层转换为1×1卷积层）。然后将所得的全卷积网络应用于整个（未裁剪）图像。结果是一个类别评分图，其中通道数等于类别数，并且空间分辨率取决于输入图像大小。最后，为了获得图像的类分数的固定大小矢量，对类分数图进行空间平均（求和）。我们还通过水平翻转图像来增强测试集。对原始图像和翻转图像的soft-max类后验进行平均，以获得图像的最终分数。

3.3 IMPLEMENTATION DETAILS

Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system,as wellas train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.

我们的实现源自可公开使用的C ++ Caffe工具箱（Jia，2013），但包含许多重大修改，使我们能够对安装在单个系统中的多个GPU进行培训和评估，以及在多个尺度上训练和评估全尺寸（未裁剪）图像（如上所述）。多GPU训练利用数据并行性，并通过将每批训练图像分成几批GPU来执行，并在每个GPU上并行处理。计算完GPU批次梯度后，将它们取平均值以获得整个批次的梯度。梯度计算在GPU之间是同步的，因此结果与在单个GPU上进行训练时完全相同。

While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.

虽然最近提出了更复杂的方法来加速ConvNet训练（Krizhevsky，2014），该方法针对网络的不同层采用了模型和数据并行性，但我们发现，从概念上讲，更简单的方案已经可以在网络上提供3.75倍的加速。与使用单个GPU相比，现成的4-GPU系统。在配备有四个NVIDIA Titan Black GPU的系统上，根据架构的不同，训练单个网络需要2-3周。

4 CLASSIFICATION EXPERIMENTS

Dataset. In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014challenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.

数据集。在本节中，我们介绍通过在ILSVRC-2012数据集（用于ILSVRC 2012-2014挑战）上描述的ConvNet架构实现的图像分类结果。数据集包括1000个类别的图像，并分为三组：训练（130万图像），验证（50K图像）和测试（100K具有类别标签的图像）。使用两种方法评估分类性能：top-1和top-5错误。前者是多类分类错误，即错误分类图像的比例；后者是ILSVRC中使用的主要评估标准，并按图像的比例进行计算，以使地面真相类别不在前5个预测类别之内。

For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).

对于大多数实验，我们使用验证集作为测试集。测试集上还进行了某些实验，并作为ILSVRC-2014竞赛的“ VGG”参赛作品提交给了官方ILSVRC服务器（Russakovsky等人，2014）。

4.1 SINGLE SCALE EVALUATION

略

4.2 MULTI-SCALE EVALUATION

略

4.3 MULTI-CROP EVALUATION

略

4.4 CONVNET FUSION

Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).

到目前为止，我们评估了各个ConvNet模型的性能。在实验的这一部分中，我们通过平均其soft-max类后验来组合几个模型的输出。由于模型的互补性，这提高了性能，并在2012年（Krizhevsky等人，2012）和2013年（Zeiler＆Fergus，2013； Sermanet等人，2014）的顶级ILSVRC提交中使用。

The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5).

结果显示在表6中。到提交ILSVRC时，我们仅训练了单尺度网络以及多尺度模型D（仅对完全连接的层而不是所有层进行了微调）。结果是7个网络的整体具有7.3％的ILSVRC测试错误。提交后，我们仅考虑了两个性能最佳的多尺度模型（配置D和E）的集合，使用密集评估将测试误差降低到7.0％，使用密集和多裁剪组合评估将测试误差降低到6.8％。作为参考，我们性能最好的单个模型实现了7.1％的误差（模型E，表5）。

4.5 COMPARISON WITH THE STATE OF THE ART

略

5 CONCLUSION

In this work we evaluated very deep convolutional networks (up to 19 weight layers) for largescale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture(LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.

在这项工作中，我们评估了非常深的卷积网络（最多19个权重层）用于大规模图像分类。事实证明，表示深度有利于分类精度，并且可以使用传统的ConvNet架构来实现ImageNet挑战数据集的最新性能（LeCun等，1989; Krizhevsky等， 2012）。在附录中，我们还表明，我们的模型可以很好地推广到各种任务和数据集，匹配或胜过围绕不太深的图像表示构建的更复杂的识别管道。我们的结果再次证实了视觉表示中深度的重要性。