经典神经网络论文阅读|GoogLeNet（Inception V1）

Christian Szegedy ; Wei Liu ; Yangqing Jia et al. “Going Deeper with Convolutions” CVPR 2015

非常好的一篇文章

Introduction 中就说了GoogleNet只有5 million 参数，12倍小于AlexNet，还更加准确。提出不要一味追求精度，还要考虑设备上部署的效率；

启发：《Network in network》: 1x1 卷积，Global Average Pooling 取代全连接层；以及《Provable Bounds for Learning some deep representations》用稀疏分散的网络取代以前庞大密集臃肿的网络

Related Work中首先提到启发来源，池化层虽然丢失空间像素精确，但是可以用来做定位和目标检测
《Overfeat: Integrated recognition, localization and detection using convolutional networks》
《Robust object recognition with cortex-like mechanisms》处理多尺度的输入（不同尺度的卷积核），类似Inception 模块
接着，提到了目标检测：《 Rich feature hierarchies for accurate object detection and semantic segmentation》首先找出候选区域，再对每个候选区域使用CNN来识别类别。
Motivation and high level considerations：增加深度和宽度，Inception 模块在利用现有模型的基础上，既保证稀疏性减少参数，又能利用密集矩阵的高计算性能；如果作为RCNN的基模型，对于定位和目标检测都很有用处

Architectural Details：设计思想是用密集模块来近似出局部最优稀疏结构；越靠近前面的层越提取局部信息，越靠近后面越提取大范围信息，所以嵌入到两层之间的Inception模块大的小的感受野都需要有；
使用1x1卷积的理由：受到embedding用低维dense向量代替高维稀疏向量启发，同时希望降维后的低维向量不过于dense，便于计算处理，采用1x1卷积，既可以降维又可以减少计算；
Various scales visual information simultaneously processed and then aggregated；（同时处理不同尺度的信息，再融合起来
GoogLeNet：GAP代替全连接层好处一，便于fine-tune迁移学习，好处二，提升了0.6%的Top-1准确度；浅层特征其实也有了一定的区分度，所以作者在4a和4b后面添加辅助分类器，计算两个辅助Loss，测试阶段去除辅助分类器。

L = L 最后 + 0.3 L 辅 1 + 0.3 L 辅 2 L=L_{最后} + 0.3 L_{辅1} + 0.3 L_{辅2} L=L最后+0.3L辅1+0.3L辅2

Training Methodology: asynchronous stochastic gradient descent 异步随机梯度下降（因为数据并行输入）；
调参玄学：dropout和learning rate，数据增强：裁剪为原图的8%-100%, 长宽比例调节至3/4-4/3之间，光度变换（Some improvements on deep convolutional neural network based image classification），等概率使用bilinear, area, nearest neighbor and cubic插值方法。

分类，物体检测，迁移学习

可以利用GAP之后的值，配合最后FC的weights，制作Class Activation Map，用于关键信息定位

Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.
The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks.
In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final sbubmission.
A schematic view of the resulting network is depicted in Figure 3 （描述图片纲要）
Their main result states that xx

用于分类任务：一张图裁剪并镜像成144个patch输入，对144个softmax结果取平均得到预测类别；采用7个模型训练好的模型进行集成；比base降低了3.45%的Top5 Error
用于物体检测任务：如果算法给出的框分类正确且与正确标签的框的交并比（jaccard index）大于0.5则预测正确，同一类别的所有分类结果可以统计Confusion Matrix，算法评估使用mAP（每个类别不同阈值下PR曲线围成的面积=AP，不同类别的平均AP=mAP）;
使用《Segmentation as selective search for object recognition》Selective search 方法，使用《Scalable object detection using deep neural networks》multi-box predications 减少无用的候选框
没有使用bounding box regression对候选框分类
使用6个 ConvNets 模型作为分类器

优点：

缺点：

Inception 和 DenseNet 都采用 Concat操作：特征拼接，让网络接着学习如何融合特征，这个过程信息不会损失，但是加大了计算量

对比 ResNet 中 Add操作：计算量小，获得新的特征，反应了原始特征的一些性质，但是必然造成了损失