（转）The Evolved Transformer - Enhancing Transformer with Neural Architecture Search

2019-03-26 19:14:33

Paper："The Evolved Transformer." So, David R., Chen Liang, and Quoc V. Le. arXiv preprint arXiv:1901.11117 (2019).

Code: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/evolved_transformer.py

This blog is copied from: https://towardsdatascience.com/the-evolved-transformer-enhancing-transformer-with-neural-architecture-search-f0073a915aca

Related blog of Transformer: https://jalammar.github.io/illustrated-transformer/

Neural architecture search (NAS) is the process of algorithmically searching for new designs of neural networks. Though researchers have developed sophisticated architectures over the years, the ability to find the most efficient ones is limited, and recently NAS has reached the point where it can outperform human-designed models.

paper by Google Brain presents the first NAS to improve Transformer, one of the leading architecture for many Natural Language Processing tasks. The paper uses an evolution-based algorithm, with a novel approach to speed up the search process, to mutate the Transformer architecture to discover a better one — The Evolved Transformer (ET). The new architecture performs better than the original Transformer, especially when comparing small mobile-friendly models, and requires less training time. The concepts presented in the paper, such as the use of NAS to evolve human-designed models, has the potential to help researchers improve their architectures in many other areas.

Background

here.

Real et al. (see pseudo-code in Appendix A):

The first pool of models is initialized by randomly sampling the search space or by using a known model as a seed.
These models are trained for the given task and randomly sampled to create subpopulation.
The best models are mutated by randomly changing a small part of their architecture, such as replacing a layer or changing the connection between two layers.
The mutated models (child models) are added to the pool while the weakest model from the subpopulation is removed from the pool.

NASNet search architecture approaches this challenge by defining “stackable cells”. A cell can contain a set of operations on its input (e.g. convolution) from a predefined vocabulary and the model is built by stacking the same cell architecture several times. The goal of the search algorithm is only to find the best architecture of a cell.

An example of the NASNet search architecture for image classification task that contains two types of stackable cells (Normal and Reduction Cell). Source: Zoph et al.

How Evolved Transformer (ET) works

As the Transformer architecture has proven itself numerous times, the goal of the authors was to use a search algorithm to evolve it into an even better model. As a result, the model frame and the search space were designed to fit the original Transformer architecture in the following way:

The algorithm searches for two types of cells — one for the encoder with six copies (blocks) and another for the decoder with eight copies.
Transformer), etc, and activation can be ReLU and Leaky ReLU. Some elements can also be an identity operation or a dead-end.
Each cell can be repeated up to six times.

paper.

ET Stackable Cell format. Source: ET

Progressive Dynamic Hurdles (PDH)

Searching the entire space might take too long if the training and evaluation of each model are prolonged. It’s possible to overcome this problem in the field of image classification by performing the search on a proxy task, such as training a smaller dataset (e.g. CIFAR-10) before testing on a bigger dataset such as ImageNet. However, the authors couldn’t find an equivalent solution for translation models and therefore introduced an upgraded version of the tournament selection algorithm.

Instead of training each model in the pool on the entire dataset, a process that takes ~10 hours on a single TPU, the training is done gradually and only for the best models in the pool. The models in the pool are trained on a given amount of samples and more models are created according to the original tournament selection algorithm. Once there are enough models in the pool, a “fitness” threshold is calculated and only the models with better results (fitness) continue to the next step. These models will be trained on another batch of samples and the next models will be created and mutated based on them. As a result, PDH significantly reduces the training time spent on failing models and increases search efficiency. The downside is that “slow starters”, models that need more samples to achieve good results, might be missed.

An example of the tournament selection training process. The models above the fitness threshold are trained on more sample and therefore reach better fitness. The fitness threshold increases in steps as new models are created. Source: ET

perplexity metric, the lower the better) of different search techniques — Transformer vs. random initialization and PDH vs. regular tournament selection (with a given number of training steps per model).

Comparison of different search techniques. Source: ET

The authors kept the total training time of each technique fixed and therefore the number of models differs: more training steps per model -> fewer total number of models can be searched and vice-versa. The PDH technique achieves the best results on average while being more stable (low variance). When reducing the number of training steps (30K), the regular technique performs almost as good on average as PDH. However, it suffers from a higher variance as it’s more prone to mistakes in the search process.

Results

WMT ’18. The search algorithm ran on 15,000 models using 270 TPUs for a total of almost 1 billion steps, while without PDH the total required steps would have been 3.6 billion. The best model found was named The Evolved Transformer (ET) and achieved better results compared to the original Transformer (perplexity of 3.95 vs 4.05) and required less training time. Its encoder and decoder block architectures are shown in the following chart (compared to the original ones).

Transformer and ET encoder and decoder architectures. Source: ET

depth-wise separable convolutions, which is more parameter efficient but less powerful compared to normal convolution. Another interesting example is the use of parallel branches (e.g. two convolution and RELU layers for the same input) in both the decoder and the encoder. The authors also discovered in an ablation study that the superior performance cannot be attributed to any single mutation of the ET compared to the Transformer.

Both ET and Transformer are heavy models with over 200 million parameters. Their size can be reduced by changing the input embedding (i.e. a word vector) size and the rest of the layers accordingly. Interestingly, the smaller the model the bigger the advantage of ET over Transformer. For example, for the smallest model with only 7 million parameters, ET outperforms Transformer by 1 perplexity point (7.62 vs 8.62).

Comparison of Transformer and ET for different model sizes (according to embedding size). FLOPS represents the training duration of the model. Source: ET

Implementation details

As mentioned, the search algorithm required used over 200 Google’s TPUs in order to train thousands of models in a reasonable time. The training of the final ET model itself is faster than the original Transformer but still takes hours with a single TPU on the WMT’14 En-De dataset.

here.

Conclusion

Evolved Transformer shows the potential of combining hand-crafted with neural search algorithms to create architectures that are consistently better and faster to train. As computing resources are still limited (even for Google), researchers still need to carefully design the search space and improve the search algorithms to outperform human-designed models. However, this trend will undoubtedly just grow stronger over time.

LyrnAI

Appendix A - Tournament Selection Algorithm

Real et al.except for the aging process of discarding the oldest models from the population: