【CV-Paper 02】ZFNet-2013

论文原文：LINK
论文年份：2013
论文被引：9898(20/08/2020)

文章目录

Visualizing and Understanding Convolutional Networks
Abstract
1. Introduction

1.1. Related Work

2. Approach

2.1. Visualization with a Deconvnet

3. Training Details
4. Convnet Visualization

4.1. Architecture Selection
4.2. Occlusion Sensitivity
4.3. Correspondence Analysis

5. Experiments
6. Discussion

Visualizing and Understanding Convolutional Networks

Abstract

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

大型卷积网络模型最近在ImageNet基准上展示了令人印象深刻的分类性能（Krizhevsky等，2012）。但是，对于它们为什么表现如此出色或如何进行改进尚无明确的了解。在本文中，我们解决了这两个问题。我们介绍了一种新颖的可视化技术，可深入了解中间特征层的功能以及分类器的操作。在诊断角色中使用时，这些可视化使我们能够找到优于Krizhevsky等人的模型架构。在ImageNet分类基准上。我们还进行了消融研究，以发现不同模型层对性能的贡献。我们展示了ImageNet模型可以很好地推广到其他数据集：重新训练softmax分类器时，它令人信服地击败了Caltech-101和Caltech-256数据集上的最新结果。

1. Introduction

Since their introduction by (LeCun et al., 1989) in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last year, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. (Ciresan et al., 2012) demonstrate state-of-the-art performance on NORB and CIF AR-10 datasets. Most notably, (Krizhevsky et al., 2012) show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Several factors are responsible for this renewed interest in convnet models: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).

自1990年代初期（LeCun等，1989）提出以来，卷积网络（convnets）在手写数字分类和面部检测等任务上表现出出色的性能。去年，几篇论文表明，它们还可以在更具挑战性的视觉分类任务中发挥出色的性能。（Ciresan等人，2012）证明了NORB和CIF AR-10数据集的最新性能。最值得注意的是，（Krizhevsky等人，2012）在ImageNet 2012分类基准上显示出创纪录的击败表现，其convnet模型的错误率达到16.4％，而第二名的结果为26.1％。对卷积模型重新产生兴趣的几个因素是：（i）提供大量训练集，并提供数百万个带有标注的示例；（ii）强大的GPU实现，使大型模型的训练变得切实可行；（iii）更好的模型正则化策略，例如Dropout（Hinton等人，2012）。

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.

尽管取得了令人鼓舞的进步，但是对于这些复杂模型的内部操作和行为，以及它们如何获得如此好的性能，仍然知之甚少。从科学的角度来看，这是远远不能令人满意的。在没有清楚地了解它们如何工作以及为什么起作用的情况下，将更好的模型的开发简化为反复试验。在本文中，我们介绍了一种可视化技术，该技术揭示了在模型的任何层上激发单个特征图的输入刺激。它还使我们能够在训练过程中观察特征的演变，并诊断模型的潜在问题。我们提出的可视化技术使用了由（Zeiler et al。，2011）提出的多层反卷积网络（deconvnet），将特征**投影回输入像素空间。我们还通过遮挡输入图像的部分来对分类器输出进行敏感性分析，揭示场景的哪些部分对于分类很重要。

Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet features is also explored in concurrent work by (Donahue et al., 2013).

使用这些工具，我们从（Krizhevsky et al。，2012）的架构开始，探索不同的架构，发现在ImageNet上性能优于其架构的架构。然后，我们仅在顶部重新训练softmax分类器，即可探索模型对其他数据集的泛化能力。因此，这是一种有监督的预训练形式，这与（Hinton等，2006）和其他人（Bengio等，2007； Vincent等，2008）普及的无监督预训练方法形成了鲜明的对比。（Donahue et al，2013）在并发工作中也探索了卷积特征的泛化能力。

1.1. Related Work

Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers this is not the case, and there are limited methods for interpreting activity. (Erhan et al., 2009) find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, (Le et al., 2010) (extending an idea by (Berkes & Wiskott, 2006)) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. (Donahue et al., 2013) show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

可视化特征以获得关于网络的直觉是常见的做法，但主要限于可以投影到像素空间的第一层。在更高的层中并非如此，并且解释活动的方法也很有限。（Erhan et al，2009）通过在图像空间中执行梯度下降来最大程度地**单元，从而找到每个单元的最佳刺激。这需要进行仔细的初始化，并且不提供有关单元不变性的任何信息。由于后者的缺点，（Le et al，2010）（扩展了（Berkes＆Wiskott，2006）的想法）表明了如何以最佳响应的数值来计算给定单元的 Hessian，从而对不变性。问题在于，对于较高的层，不变性非常复杂，因此通过简单的二次逼近很难捕获。相比之下，我们的方法提供了不变性的非参数视图，显示了训练集中的哪些模式**了特征图。（Donahue et al，2013）显示了可视化，该可视化可识别数据集中的补丁，这些补丁负责模型更高层的强**。我们的可视化的不同之处在于，它们不仅是输入图像的裁剪，而且是自顶向下的投影，它们揭示了每个补丁中刺激特定特征图的结构。

2. Approach

We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012). These models map a color 2D input image xi, via a series of layers, to a probability vector ˆ yiover the C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function (relu(x) = max(x,0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see (Krizhevsky et al., 2012) and (Jarrett et al., 2009). The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.

根据（LeCun等，1989）和（Krizhevsky等，2012）的定义，我们在整个论文中使用标准的完全监督卷积模型。这些模型通过一系列图层将彩色2D输入图像xi映射到C个不同类别上的概率向量。每一层包括：（i）将上一层输出（或在第一层的情况下为输入图像）与一组学习的滤波器进行卷积；（ii）通过一个线性校正函数传递响应（relu（x）= max（x，0））；（iii）[可选]在本地邻域上的最大池化；以及（iv）[可选]进行跨特征图响应标准化的局部对比操作。有关这些操作的更多详细信息，请参见（Krizhevsky等，2012）和（Jarrett等，2009）。网络的最顶层是常规的全连接网络，最后一层是softmax分类器。图3显示了我们许多实验中使用的模型。

We train these models using a large set of N labeled images $\{x, y\}$ , where label $y_i$ is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare $\hat{y_i}$ and $y_i$ . The parameters of the network (filters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by backpropagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Full details of training are given in Section 3.

我们使用大量的N个标记图像 $\{x，y\}$ 训练这些模型，其中 $y_i$ 是表示真实类别的离散变量。适用于图像分类的交叉熵损失函数用于比较 $\hat{y_i}$ 和 $y_i$ 。网络的参数（卷积层中的滤波器，全连接层中的权重矩阵和偏差）通过反向传播相对于整个网络的参数的导数损失，并通过随机梯度下降更新参数来进行训练。训练的完整细节在第3节中给出。

2.1. Visualization with a Deconvnet

Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.

了解卷积网络的操作需要在中间层中解释特征**。我们提出了一种新颖的方法来将这些活动映射回输入像素空间，以显示最初由哪种输入模式导致了特征图中的给定**。我们使用反卷积网络（deconvnet）进行此映射（Zeiler等，2011）。 deconvnet可以被认为是使用相同组件（过滤，池化）但反过来使用的convnet模型，因此，与其将像素映射到特征相反，反之亦然。在（Zeiler et al，2011）中，提出了去卷积网络作为一种执行无监督学习的方法。在这里，它们不以任何学习能力使用，就像对已经训练过的卷积网络的探究一样。

To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.

为了检查卷积网络，将deconvnet附加到其每一层，如图1（顶部）所示，提供返回图像像素的连续路径。首先，将输入图像呈现给卷积网络，并在整个图层中计算特征。为了检查给定的convnet**，我们将层中的所有其他**设置为零，并将特征映射作为输入传递到附加的deconvnet层。然后，我们依次(i) unpool， (ii) rectify ， (iii) filter以重建下层中导致选定**的**。然后重复此操作，直到达到输入像素空间为止。

Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.

Unpooling：在卷积网络中，最大池化操作是不可逆的，但是我们可以通过在一组开关变量中记录每个池化区域内最大值的位置来获得近似逆。在去卷积网络中，解池操作使用这些开关将来自上一层的重建内容放置到适当的位置，从而保留刺激的结构。有关该过程的说明，请参见图1（底部）。

Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity.

Rectification：卷积网络使用relu非线性，可校正特征图，从而确保特征图始终为正。为了在每一层上获得有效的特征重建（也应该是正数），我们将重建的信号通过relu非线性传递。

Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.

Filtering：卷积网络使用学习的过滤器对来自上一层的特征图进行卷积。为了解决这个问题，deconvnet使用相同过滤器的转置版本，但应用于校正后的特征图，而不是其下一层的输出。实际上，这意味着垂直和水平翻转每个过滤器。

Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved.

从较高的层向下投影使用向上转换时convnet中最大池化生成的开关设置。由于这些开关设置特定于给定的输入图像，因此从单次**获得的重建效果类似于原始输入图像的一小部分，其结构根据其对特征**的贡献而加权。由于对模型进行了判别式训练，因此它们隐式显示了输入图像的哪些部分是判别式的。请注意，这些预测并非来自模型的样本，因为不涉及生成过程。

【CV-Paper 02】ZFNet-2013
Figure 1. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct an approximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.

图1.顶部：连接到convnet层（右）的deconvnet层（左）。 deconvnet将从下面的层重建convnet特征的近似版本。下图：在deconvnet中unpooling操作的示意图，其中使用了开关，这些开关记录了在convnet中进行池化时每个池化区域（彩色区域）中局部最大值的位置。

3. Training Details

We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model.

现在，我们描述第4节中将可视化的大型convnet模型。图3所示的体系结构类似于（Krizhevsky et al，2012）用于ImageNet分类的体系结构。区别之一是Krizhevsky的第3、4、5层中使用的稀疏连接（由于模型被划分为2个GPU），因此在我们的模型中被密集的连接所取代。

Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1.

如第4.1节所述，在检查了图6中的可视化之后，做出了与第1层和第2层有关的其他重要区别。

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of $10^{-2}$ , in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−2and biases are set to 0.

该模型在ImageNet 2012训练集中进行了训练（130万张图像，1000类）。通过将最小尺寸调整为256，裁剪中心256x256区域，减去每像素均值（对所有图像）进行预处理，然后使用尺寸为224x224（四角+中心及其水平翻转的10个不同子区域进行预处理）。最小批量为128的随机梯度下降用于更新参数，学习率从 $10^{-2}$ 开始，动量项设为 0.9。当验证错误达到稳定水平时，在整个训练过程中对学习速率进行退火。全连接层（6和7）中使用Dropout（Hinton等人，2012）的比率为0.5。将所有权重初始化为 $10^{-2}$ ，偏差设置为 0。

Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012).

训练过程中第一层过滤器的可视化显示，其中一些占主导地位，如图6（a）所示。为了解决这个问题，我们将卷积层中的每个滤波器的均方根值均化为该固定半径，这些卷积的RMS值超过10-1的固定半径。这一点至关重要，尤其是在模型的第一层中，其中输入图像大约在[-128,128]范围内。与（Krizhevsky et al，2012）中一样，我们为每个训练示例生成多种裁剪和翻转，以扩大训练集的大小。我们使用了基于（Krizhevsky et al，2012）的实现，在70个纪元后停止了训练，这在单个GTX580 GPU上花费了大约12天。

4. Convnet Visualization

Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.

使用第3节中描述的模型，我们现在使用deconvnet可视化ImageNet验证集上的功能**。

Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. However, instead of showing the single strongest activation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.

特征可视化：图2显示了训练完成后来自我们模型的特征可视化。但是，我们没有显示给定特征图的最强**，而是显示了前9个**。将每个像素分别投影到像素空间会揭示出激发给定特征图的不同结构，从而显示出其对输入变形的不变性。除了这些可视化之外，我们还显示了相应的图像补丁。它们比可视化具有更大的变化，因为可视化仅专注于每个补丁内的判别结构。例如，在第5层第1行第2列中，补丁看上去几乎没有什么共同点，但可视化显示该特定的特征图着重于背景中的草皮，而不是前景中的对象。
【CV-Paper 02】ZFNet-2013
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.

图2.经过完全训练的模型中的特征可视化。对于第2-5层，我们在验证数据的特征图的随机子集中显示了前9个**，并使用我们的反卷积网络方法投影到像素空间。我们的重建不是来自模型的样本：它们是来自验证集的重建模式，该模式在给定的特征图中导致高度**。对于每个特征图，我们还显示相应的图像补丁。注意：（i）每个特征图内的强分组，（ii）较高层的不变性较大，以及（iii）图像可区分部分的夸大，例如狗的眼睛和鼻子（第4层，第1行，第1列）。最好以电子形式查看。

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4).

每层的投影显示了网络中功能的分层性质。第2层响应角点和其他边缘/颜色相交。第3层具有更复杂的不变性，捕获相似的纹理（例如网格图案（第1行，第1行）；文本（R2，C4））。第4层显示出明显的变化，但更具类别性：狗脸（R1，C1）；鸟腿（R4，C2）。第5层显示整个对象的姿态变化很大，例如键盘（R1，C11）和狗（R4）。

Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

训练期间的特征演化：图4可视化了在投影回像素空间的给定特征图中，最强**（在所有训练示例中）的训练过程。外观突然跳变是由于最强烈的**所源自的图像变化所致。可以看到模型的较低层在几个时期内收敛。但是，只有在相当多个时期（40-50）之后才会发展上层，这表明需要让模型训练直到完全收敛。

Feature Invariance: Fig. 5 shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasilinear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).

特征不变性：图5展示了5个样本图像，它们以不同程度进行平移，旋转和缩放，同时查看来自模型顶层和底层的特征向量相对于未转换特征的变化。较小的变换在模型的第一层产生了巨大的影响，但是在顶层特征层的影响较小，对于平移和缩放是准线性的。网络输出对于转换和缩放是稳定的。通常，除了具有旋转对称性的对象（例如娱乐中心）外，输出对于旋转不是不变的。

4.1. Architecture Selection

While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 6(b) & (d)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 6© & (e). More importantly, it also improves the classification performance as shown in Section 5.1.

虽然经过训练的模型的可视化可以深入了解其操作，但它也可以首先帮助选择好的架构。通过可视化Krizhevsky等人的第一层和第二层的架构（图6（b）和（d）），各种问题显而易见。第一层滤波器混合了极高和极低的频率信息，很少覆盖中频。此外，第二层可视化还显示了由第一层卷积中使用的大步幅=4引起的混叠伪影（aliasing artifacts）。为了解决这些问题，我们（i）将第一层过滤器的尺寸从11x11减小到7x7，并且（ii）进行了卷积2而不是4的跨越。这种新架构在第一层和第二层功能中保留了更多信息，如图6（c）和（e）所示。更重要的是，它还提高了分类性能，如5.1节所示。

4.2. Occlusion Sensitivity

With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 7 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.

对于图像分类方法，一个自然的问题是模型是真正识别图像中对象的位置，还是仅使用周围环境。图7试图通过用灰色方块系统地遮盖输入图像的不同部分并监视分类器的输出来回答这个问题。这些示例清楚地表明，该模型正在场景中定位对象，因为当对象被遮挡时，正确类别的概率会大大降低。图7还显示了顶部卷积层的最强特征图的可视化效果，此外，该图（根据空间位置求和）中的活动也取决于遮挡物位置。当遮挡物覆盖显示在可视化中的图像区域时，我们会看到特征图中的活动急剧下降。这表明可视化确实与刺激该特征图的图像结构相对应，从而验证了图4和图2所示的其他可视化。

4.3. Correspondence Analysis

Deep models differ from many existing recognition approaches in that there is no explicit mechanism for establishing correspondence between specific object parts in different images (e.g. faces have a particular spatial configuration of the eyes and nose). However, an intriguing possibility is that deep models might be implicitly computing them. To explore this, we take 5 randomly drawn dog images with frontal pose and systematically mask out the same part of the face in each image (e.g. all left eyes, see Fig. 8).For each image $i$ , we then compute: $\epsilon^l_i= x^l_i−\tilde{x^l_i}$ , where $x^l_i$ and $\tilde{x^l_i}$ are the feature vectors at layer $l$ for the original and occluded images respectively. We then measure the consistency of this difference vector $\epsilon$ between all related image pairs $(i, j)$ : $\delta_l=\sum^5_{i,j=1,i \neq j} H(sign(\epsilon^l_i),sign(\epsilon^l_j))$ , where $H$ is Hamming distance. A lower value indicates greater consistency in the change resulting from the masking operation, hence tighter correspondence between the same object parts in different images (i.e. blocking the left eye changes the feature representation in a consistent way). In Table 1 we compare the $∆$ score for three parts of the face (left eye, right eye and nose) to random parts of the object, using features from layer $l = 5$ and $l = 7$ . The lower score for these parts, relative to random object regions, for the layer 5 features show the model does establish some degree of correspondence.

深度模型与许多现有的识别方法不同，因为没有明确的机制来建立不同图像中特定对象部分之间的对应关系（例如，脸部具有眼睛和鼻子的特定空间配置）。但是，一个有趣的可能性是深层模型可能会隐式地对其进行计算。为了探索这一点，我们以正面姿势拍摄了5张随机绘制的狗图像，并在每张图像中系统地遮掩了脸部的同一部分（例如，所有左眼，见图8）。然后，对于每个图像 $i$ ，我们计算： $\epsilon^l_i= x^l_i−\tilde{x^l_i}$ ，其中 $x^l_i$ 和 $\tilde{x^l_i}$ 分别是原始图像和遮挡图像在l层的特征向量。然后，我们测量所有相关图像对 $(i, j)$ 之间的差异矢量 $\epsilon$ 的一致性： $\delta_l=\sum^5_{i,j=1,i \neq j} H(sign(\epsilon^l_i),sign(\epsilon^l_j))$ ，其中 $H$ 是Hamming距离。较低的值表示由遮罩操作导致的更改的一致性更高，因此，不同图像中相同对象部分之间的对应关系更紧密（即，遮挡左眼会以一致的方式更改特征表示）。在表1中，我们使用 $l = 5$ 层和 $l = 7$ 层的特征，比较了面部的三个部分（左眼，右眼和鼻子）与对象的随机部分的 $∆$ 得分。对于随机对象区域，对于第5层特征显示模型确实建立了一定程度的对应性。

【CV-Paper 02】ZFNet-2013
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape.

图3.我们的8层convnet模型的体系结构。图像的224 x 224裁剪（带有3个彩色平面）作为输入呈现。它与96个不同的第一层滤镜（红色）进行卷积，每个滤镜的大小为7×7，x和y的步幅均为2。然后将生成的特征图：（i）通过校正的线性函数（未显示），（ii）合并（3x3区域内的最大值，使用步幅2），（iii）跨特征图的对比度归一化，得出96种不同的55×55个特征特征图。在第2、3、4、5层中重复类似的操作。最后两层是完全连接的，将来自顶部卷积层的特征作为矢量形式的输入（6·6·256 = 9216维度）。最后一层是C向softmax函数，C是类数。所有过滤器和功能图均为正方形。

【CV-Paper 02】ZFNet-2013
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form.

图4.通过训练随机选择的模型特征子集的演变。每个图层的特征显示在不同的块中。在每个块中，我们显示在时期[1,2,5,10,20,30,40,64]中随机选择的特征子集。可视化显示了给定特征图的最强**（在所有训练示例中），并使用我们的deconvnet方法投影到像素空间。人为地增强了色彩对比度，并且最好以电子形式查看该图。

【CV-Paper 02】ZFNet-2013
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5 example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the image is transformed.

图5.模型内垂直平移，比例尺和旋转不变性的分析（分别为a-c行）。第1列：进行转换的5个示例图像。第2列和第3列：分别来自第1层和第7层中原始图像和变换图像的特征向量之间的欧式距离。第4列：图像经过变换后，每个图像被标记为真的概率。

【CV-Paper 02】ZFNet-2013
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features from (Krizhevsky et al., 2012). ©: Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (d).

图6.（a）：没有特征比例裁剪的第一层特征。请注意，一个特征占主导。（b）：（Krizhevsky et al。，2012）的第一层特征。（c）：我们的第一层特征。较小的步幅（2对4）和过滤器大小（7x7对11x11）可提供更多独特的特征和更少的“死角”特征。（d）：（Krizhevsky et al，2012）中第二层特征的可视化。（e）：第二层特征的可视化。这些更干净，没有在（d）中可见的锯齿失真。

【CV-Paper 02】ZFNet-2013
Figure 7. Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & ©) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). ©: a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps.

图7.三个测试示例，其中我们用灰色正方形（第1列）系统地覆盖了场景的不同部分，并查看了顶部（第5层）特征映射（（b）和（c））和分类器输出（（d ）＆（e））更改。（b）：对于灰度的每个位置，我们将总**记录在一个5层特征图中（在无遮挡图像中响应最强的一个）。（c）：此特征图的可视化投影到输入图像（黑色正方形）中，以及该图的其他图像的可视化。第一行示例显示了最强的特征是狗的脸。遮盖住后，特征图中的活动会减少（（b）中的蓝色区域）。（d）：根据灰色正方形的位置，正确分类概率图。例如。当狗的脸被遮盖时，“博美犬”的可能性将大大降低。（e）：最可能的标签是封堵器位置的函数。例如。在第一行中，对于大多数位置来说，它是“博美犬”，但是如果狗的脸被遮盖而不是被遮盖，则它表示“网球”。在第二个示例中，汽车上的文字是第5层中最强大的特征，但分类器对车轮最敏感。第三个示例包含多个对象。第5层中最强的特征会拾取人脸，但分类器对狗（（d）中的蓝色区域）敏感，因为它使用了多个特征图。

【CV-Paper 02】ZFNet-2013
Figure 8. Images used for correspondence experiments. Col 1: Original image. Col 2,3,4: Occlusion of the right eye, left eye, and nose respectively. Other columns show examples of random occlusions.

图8.用于对应实验的图像。第1列：原始图像。第2,3,4列：分别遮挡右眼，左眼和鼻子。其他列显示了随机遮挡的示例。
【CV-Paper 02】ZFNet-2013
Table 1. Measure of correspondence for different object parts in 5 different dog images. The lower scores for the eyes and nose (compared to random object parts) show the model implicitly establishing some form of correspondence of parts at layer 5 in the model. At layer 7, the scores are more similar, perhaps due to upper layers trying to discriminate between the different breeds of dog.

表1. 5个不同的狗图像中不同对象部分的对应性度量。眼睛和鼻子的较低分数（与随机物体的部分相比）表明该模型隐式地在模型的第5层建立某种形式的部分对应关系。在第7层，得分更为相似，可能是由于上层试图区分不同品种的狗。

5. Experiments

略

6. Discussion

We explored large convolutional neural network models, trained for image classification, in a number ways. First, we presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers. We also showed how these visualization can be used to debug problems with the model to obtain better results, for example improving on Krizhevsky et al. ’s (Krizhevsky et al., 2012) impressive ImageNet 2012 result. We then demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.

我们以多种方式探索了大型卷积神经网络模型，这些模型针对图像分类进行了训练。首先，我们提出了一种新颖的方式来可视化模型中的活动。这表明特征远非随机的，无法解释的模式。相反，它们在提升层时显示了许多直观上理想的属性，例如组成，不断增加的不变性和类区分。我们还展示了如何使用这些可视化工具来调试模型中的问题以获得更好的结果，例如对Krizhevsky等进行改进（Krizhevsky et al，2012）ImageNet 2012令人印象深刻的结果。然后，我们通过一系列遮挡实验证明，该模型在进行分类训练时对图像中的局部结构高度敏感，而不仅仅是使用宽广的场景上下文。对模型的消融研究表明，对网络（而不是任何单个部分）的最小深度对于模型的性能至关重要。

Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. This result brings into question to utility of benchmarks with small (i.e. $<10^4$ ) training sets. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias (Torralba & Efros, 2011), although it was still within 3.2% of the best reported result, despite no tuning for the task. For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.

最后，我们展示了ImageNet训练的模型如何能够很好地推广到其他数据集。对于Caltech-101和Caltech-256，数据集足够相似，因此我们可以击败报告的最佳结果，在后一种情况下，结果可观。这个结果使具有较小（即 $<10^4$ ）训练集的基准的效用受到质疑。我们的卷积模型对PASCAL数据的泛化效果较差，可能会遭受数据集偏差的影响（Torralba＆Efros，2011年），尽管它仍未达到最佳报告结果的3.2％，尽管该任务并未进行调整。例如，如果使用不同的损失函数（每个图像允许多个对象），我们的性能可能会提高。这自然将使网络也能够处理对象检测。