截图均来自原论文,如有侵权,请联系删除。

1、 论文主要特点

高效的产生针对特定任务的模型, 一次NAT的运行就能有效地获得多个任务的神经网络

  • 训练task-specific super-net

  • 从super-net中采样特定的子网络,而不需要额外的训练。

大量实验表明:

​ 通过对ImageNet上预训练好的模型进行迁移学习,往往比直接在小数据集上进行训练得到的模型好

关键是:

  • 一个综合的在线迁移学习
  • 多目标进化搜索过程

在搜索特定任务子网的同时,对预先训练好的super-net进行迭代调整。

NAT返回两个:

  1. 适用于不同任务的子网
  2. super-net

训练包含重复循环两个阶段:

1、Adapt super-net

  • 首先从进化搜索返回的最佳子网构造一个分层的经验分布。
  • 然后,对从该分布中采样的子网进行微调

2、search stage

  • 采用代理模型来快速预测任何抽样子网的objectives,而不需要进行全面和昂贵的评估。
  • 预测模型本身也在线学习以前的评估子网

NAS之《Neural Architecture Transfer》论文笔记

2、提出的方法

1、三个重要组成:

  1. an accuracy predictor,
  2. an evolutionary search routine,
  3. a supernet.

2、算法流程:

开始时, an archive A of architectures (subnets) 从supernet中随机采样子网络,子网络参数直接继承于supernet。

然后重复以下两步:

  1. 在进化搜索结束后,有前途的子网络被添加进 A 中。
  2. A中排序靠前的子网在supernet中的权重得到微调。

output : the archive and the task-specifific supernet.

NAS之《Neural Architecture Transfer》论文笔记

3、搜索空间 :

每个网络由五个stages组成。 每个阶段有两到四层。 每层为残差网络块。

NAS之《Neural Architecture Transfer》论文笔记

需要搜索的包括:

  • input image resolution ®,
  • width multiplier (W),
  • the number of layers in each stage,
  • the expansion ratio E of output channels of the first 1 × 1 convolution
  • the kernel size (K) of the depth wise separable convolution in each layer.

搜索空间的大小为:$ 3.5 × 10^{19} $

4、 网络编码(Encoding)

​ To encode these architectural choices, we use an integer string of length 22.

NAS之《Neural Architecture Transfer》论文笔记

The first two values represent

  • the input image resolution
  • width multiplier

The remaining 20 values denote

  • the expansion ratio 候选值: [3, 4, 6]
  • kernelsize settings for each of the 20 layers. 候选值: [3, 5, 7]

5、Accuracy Predictor

​ 以前预测方法的缺点:

  1. PNAS [21] uses 1,160 subnets to build the surrogate but only achieves a rank-order correlation of 0.476.
  2. Once-For-All [28] uses 16,000 subnets to build the surrogate.

原因:

  • The poor sample complexity and rank-order correlation of these approaches, is due to the offline learning of the surrogate model. Instead of focusing on models that are at the trade-off front of the objectives, these surrogate models are built for the entire search space.

解决方案:

  • We overcome the aforementioned limitation by restricting the surrogate model to the search space that constitutes the current objective trade-off.
  • Such a solution signifificantly reduces the sample complexity of the surrogate and increases the reliability of its predictions.

提出的方法:RBF Ensemble

  1. predict the performance of a sampled subnet without performing training or inference.
  2. decouples the evaluation of an architecture from data
  3. the evaluation time reduces from hours/minutes to seconds.

NAS之《Neural Architecture Transfer》论文笔记

  • To further improve RBF’s performance, especially under a high sample efficiency regime, we construct an ensemble of RBF models. As outlined in Algorithm 2, each RBF model is onstructed with a subset of samples and features randomly selected from the training instances.
  • the RBF ensemble can be learned under a minute.

6、Evolutionary Search

  • 进化算法是一个迭代过程。使得随机采样的最初结构,逐渐提成为一个group,referred to as a population.
  • 每批后代网络都是在有前途的父代网络上应用突变交叉产出。
  • 每一个成员(父母或者后代)都为了生存和繁殖而竞争。

​ 主要有两个操作:

交叉:在两个或多个人口成员之间交换信息,以创建两个或多个新成员。

  • ​ 本文使用旅行商算法选择父母进行匹配
  • ​ 每次交叉产生两个子代结构,每一代产生一个与父代相同大小的子代种群。

突变:是一个局部算子,它扰乱一个解决方案,在其附近产生一个新的解决方案

  • ​ 使用一个离散版本的多项式变异(PM)算子[58],并将其应用到每一个由交叉算子创建的后代

NAS之《Neural Architecture Transfer》论文笔记

NAS之《Neural Architecture Transfer》论文笔记

进化算法流程:

NAS之《Neural Architecture Transfer》论文笔记

7、Selection

​ we adopt the reference point guided selection originally proposed in NSGA-III

​ 具体的没看懂,有空看看原论文NSGA-III.

8、Adapt supernet

​ 训练supernet时,只训练进化算法建议的subnets. 而不是全部训练或者每个子网有相同的机会得到训练。

​ 因为作者认为:

  • Firstly, not all subnets are equally important for the task at hand.
  • Secondly, only a tiny fraction of the search space can practically be explored by a NAS algorithms.

​ Specifically, we seek to exploit the knowledge gained from the search process so far.

​ 算法流程:

  1. we construct a categorical distribution from architectures in the archive, where the probability for ith integer taking on the j value is computed as:

    NAS之《Neural Architecture Transfer》论文笔记

  2. In each training step (batch of data) we sample an integer-string from the above distribution.

  3. We then activate the sub parts of the supernet corresponding to the architecture decoded from the integer-string

  4. Only weights corresponding to the activated sub parts in the supernet will be updated in each step.

NAS之《Neural Architecture Transfer》论文笔记

相关文章: