NAS之《Neural Architecture Transfer》论文笔记

截图均来自原论文，如有侵权，请联系删除。

1、论文主要特点

高效的产生针对特定任务的模型, 一次NAT的运行就能有效地获得多个任务的神经网络

训练task-specific super-net
从super-net中采样特定的子网络，而不需要额外的训练。

大量实验表明：

通过对ImageNet上预训练好的模型进行迁移学习，往往比直接在小数据集上进行训练得到的模型好

关键是：

一个综合的在线迁移学习
多目标进化搜索过程

在搜索特定任务子网的同时，对预先训练好的super-net进行迭代调整。

NAT返回两个：

适用于不同任务的子网

super-net

训练包含重复循环两个阶段：

1、Adapt super-net

首先从进化搜索返回的最佳子网构造一个分层的经验分布。

然后，对从该分布中采样的子网进行微调

2、search stage

采用代理模型来快速预测任何抽样子网的objectives，而不需要进行全面和昂贵的评估。

预测模型本身也在线学习以前的评估子网

2、提出的方法

1、三个重要组成：

an accuracy predictor,

an evolutionary search routine,

a supernet.

2、算法流程：

开始时， an archive A of architectures (subnets) 从supernet中随机采样子网络，子网络参数直接继承于supernet。

然后重复以下两步：

在进化搜索结束后，有前途的子网络被添加进 A 中。

A中排序靠前的子网在supernet中的权重得到微调。

output : the archive and the task-specifific supernet.

3、搜索空间：

每个网络由五个stages组成。每个阶段有两到四层。每层为残差网络块。

需要搜索的包括：

input image resolution ®,

width multiplier (W),

the number of layers in each stage,

the expansion ratio E of output channels of the first 1 × 1 convolution

the kernel size (K) of the depth wise separable convolution in each layer.

搜索空间的大小为：$ 3.5 × 10^{19} $

4、网络编码（Encoding）

To encode these architectural choices, we use an integer string of length 22.

The first two values represent

the input image resolution

width multiplier

The remaining 20 values denote

the expansion ratio 候选值： [3, 4, 6]

kernelsize settings for each of the 20 layers. 候选值： [3, 5, 7]

5、Accuracy Predictor

以前预测方法的缺点：

PNAS [21] uses 1,160 subnets to build the surrogate but only achieves a rank-order correlation of 0.476.

Once-For-All [28] uses 16,000 subnets to build the surrogate.

原因：

The poor sample complexity and rank-order correlation of these approaches, is due to the offline learning of the surrogate model. Instead of focusing on models that are at the trade-off front of the objectives, these surrogate models are built for the entire search space.

解决方案：

We overcome the aforementioned limitation by restricting the surrogate model to the search space that constitutes the current objective trade-off.

Such a solution signifificantly reduces the sample complexity of the surrogate and increases the reliability of its predictions.

提出的方法：RBF Ensemble

predict the performance of a sampled subnet without performing training or inference.

decouples the evaluation of an architecture from data

the evaluation time reduces from hours/minutes to seconds.

To further improve RBF’s performance, especially under a high sample efficiency regime, we construct an ensemble of RBF models. As outlined in Algorithm 2, each RBF model is onstructed with a subset of samples and features randomly selected from the training instances.

the RBF ensemble can be learned under a minute.

6、Evolutionary Search

进化算法是一个迭代过程。使得随机采样的最初结构，逐渐提成为一个group,referred to as a population.

每批后代网络都是在有前途的父代网络上应用突变和交叉产出。

每一个成员（父母或者后代）都为了生存和繁殖而竞争。

主要有两个操作：

交叉：在两个或多个人口成员之间交换信息，以创建两个或多个新成员。

本文使用旅行商算法选择父母进行匹配

每次交叉产生两个子代结构，每一代产生一个与父代相同大小的子代种群。

突变：是一个局部算子，它扰乱一个解决方案，在其附近产生一个新的解决方案

使用一个离散版本的多项式变异(PM)算子[58]，并将其应用到每一个由交叉算子创建的后代

进化算法流程：

7、Selection

we adopt the reference point guided selection originally proposed in NSGA-III

具体的没看懂，有空看看原论文NSGA-III.

8、Adapt supernet

训练supernet时，只训练进化算法建议的subnets. 而不是全部训练或者每个子网有相同的机会得到训练。

因为作者认为：

Firstly, not all subnets are equally important for the task at hand.

Secondly, only a tiny fraction of the search space can practically be explored by a NAS algorithms.

Specifically, we seek to exploit the knowledge gained from the search process so far.

算法流程：

we construct a categorical distribution from architectures in the archive, where the probability for ith integer taking on the j value is computed as:

In each training step (batch of data) we sample an integer-string from the above distribution.

We then activate the sub parts of the supernet corresponding to the architecture decoded from the integer-string

Only weights corresponding to the activated sub parts in the supernet will be updated in each step.