截图均来自原论文,如有侵权,请联系删除。
1、 论文主要特点
高效的产生针对特定任务的模型, 一次NAT的运行就能有效地获得多个任务的神经网络
-
训练task-specific super-net
-
从super-net中采样特定的子网络,而不需要额外的训练。
大量实验表明:
通过对ImageNet上预训练好的模型进行迁移学习,往往比直接在小数据集上进行训练得到的模型好
关键是:
- 一个综合的在线迁移学习
- 多目标进化搜索过程
在搜索特定任务子网的同时,对预先训练好的super-net进行迭代调整。
NAT返回两个:
- 适用于不同任务的子网
- super-net
训练包含重复循环两个阶段:
1、Adapt super-net
- 首先从进化搜索返回的最佳子网构造一个分层的经验分布。
- 然后,对从该分布中采样的子网进行微调
2、search stage
- 采用代理模型来快速预测任何抽样子网的objectives,而不需要进行全面和昂贵的评估。
- 预测模型本身也在线学习以前的评估子网
2、提出的方法
1、三个重要组成:
- an accuracy predictor,
- an evolutionary search routine,
- a supernet.
2、算法流程:
开始时, an archive A of architectures (subnets) 从supernet中随机采样子网络,子网络参数直接继承于supernet。
然后重复以下两步:
- 在进化搜索结束后,有前途的子网络被添加进 A 中。
- A中排序靠前的子网在supernet中的权重得到微调。
output : the archive and the task-specifific supernet.
3、搜索空间 :
每个网络由五个stages组成。 每个阶段有两到四层。 每层为残差网络块。
需要搜索的包括:
- input image resolution ®,
- width multiplier (W),
- the number of layers in each stage,
- the expansion ratio E of output channels of the first 1 × 1 convolution
- the kernel size (K) of the depth wise separable convolution in each layer.
搜索空间的大小为:$ 3.5 × 10^{19} $
4、 网络编码(Encoding)
To encode these architectural choices, we use an integer string of length 22.
The first two values represent
- the input image resolution
- width multiplier
The remaining 20 values denote
- the expansion ratio 候选值: [3, 4, 6]
- kernelsize settings for each of the 20 layers. 候选值: [3, 5, 7]
5、Accuracy Predictor
以前预测方法的缺点:
- PNAS [21] uses 1,160 subnets to build the surrogate but only achieves a rank-order correlation of 0.476.
- Once-For-All [28] uses 16,000 subnets to build the surrogate.
原因:
- The poor sample complexity and rank-order correlation of these approaches, is due to the offline learning of the surrogate model. Instead of focusing on models that are at the trade-off front of the objectives, these surrogate models are built for the entire search space.
解决方案:
- We overcome the aforementioned limitation by restricting the surrogate model to the search space that constitutes the current objective trade-off.
- Such a solution signifificantly reduces the sample complexity of the surrogate and increases the reliability of its predictions.
提出的方法:RBF Ensemble
- predict the performance of a sampled subnet without performing training or inference.
- decouples the evaluation of an architecture from data
- the evaluation time reduces from hours/minutes to seconds.
- To further improve RBF’s performance, especially under a high sample efficiency regime, we construct an ensemble of RBF models. As outlined in Algorithm 2, each RBF model is onstructed with a subset of samples and features randomly selected from the training instances.
- the RBF ensemble can be learned under a minute.
6、Evolutionary Search
- 进化算法是一个迭代过程。使得随机采样的最初结构,逐渐提成为一个group,referred to as a population.
- 每批后代网络都是在有前途的父代网络上应用突变和交叉产出。
- 每一个成员(父母或者后代)都为了生存和繁殖而竞争。
主要有两个操作:
交叉:在两个或多个人口成员之间交换信息,以创建两个或多个新成员。
- 本文使用旅行商算法选择父母进行匹配
- 每次交叉产生两个子代结构,每一代产生一个与父代相同大小的子代种群。
突变:是一个局部算子,它扰乱一个解决方案,在其附近产生一个新的解决方案
- 使用一个离散版本的多项式变异(PM)算子[58],并将其应用到每一个由交叉算子创建的后代
进化算法流程:
7、Selection
we adopt the reference point guided selection originally proposed in NSGA-III
具体的没看懂,有空看看原论文NSGA-III.
8、Adapt supernet
训练supernet时,只训练进化算法建议的subnets. 而不是全部训练或者每个子网有相同的机会得到训练。
因为作者认为:
- Firstly, not all subnets are equally important for the task at hand.
- Secondly, only a tiny fraction of the search space can practically be explored by a NAS algorithms.
Specifically, we seek to exploit the knowledge gained from the search process so far.
算法流程:
we construct a categorical distribution from architectures in the archive, where the probability for ith integer taking on the j value is computed as:
In each training step (batch of data) we sample an integer-string from the above distribution.
We then activate the sub parts of the supernet corresponding to the architecture decoded from the integer-string
Only weights corresponding to the activated sub parts in the supernet will be updated in each step.