The Rise of Meta Learning

2019-10-18 06:48:37

This blog is from: https://towardsdatascience.com/the-rise-of-meta-learning-9c61ffac8564

Oct 16 · 9 min read

https://openai.com/blog/solving-rubiks-cube/

of Deep Learning and Artificial Intelligence.

evolution from block orientation to solving a rubik’s cube is fueled by a Meta-Learning algorithm controlling the training data distribution in simulation, Automatic Domain Randomization (ADR).

Domain Randomization — Data Augmentation

adversarial noise injections, Deep Convolutional Neural Networks will not generalize when trained on images in simulation (displayed below on the left) to real visual data (shown below on the right) without special modifications.

“Solving Rubik’s Cube with a Robot Hand” by Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, Lei Zhang

Generative Adversarial Network to make simulated images appear as realistic as possible, criticized by a discriminator classifying images as belonging to the real or simulated dataset. The research reports a positive result on eye gaze estimation and hand pose estimation. The other approach is to make the simulated data as diverse as possible, contrarily to as realistic as possible.

Domain Randomization. This idea is well illustrated in the image below from Tobin et al.’s paper in 2017:

“Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World” by Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, Pieter Abbeel

Domain Randomization appears to be the key to bridging the Sim2Real gap, allowing Deep Neural Networks to generalize to real data when trained on simulation. Not unlike most algorithms, Domain Randomization comes with many parameters to be tuned. The image below shows randomizations in the colors of the blocks, the lighting of the environment, and the magnitude of shadows, to name a few. Each of these randomized environment features comes with a lower to upper bound interval and some kind of sampling distribution. For example, when sampling a randomized environment, what is the probability this environment has very bright lighting?

See Appendeix B of Solving Rubik’s Cube with a Robot Hand for more details).

automated rather than manually designed, clearly defined in the following lines of the ADR algorithm:

Image from “Solving Rubik’s Cube with a Robot Hand”. If the agent’s performance exceeds a parametric performance threshold, the intensity of the randomization is increased (given by delta with phi defining the distribution of the environment parameters)

AI that Designs its Own Data

Paired Open-Ended Trailblazer (POET) algorithm developed by researchers at Uber AI Labs.

“Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions” by Rui Wang, Joel Lehman, Jeff Clune, Kenneth O. Stanley

POET trains a bipedal walking agent by simultaneously optimizing the agent and the environment in which it learns to walk in. POET is different from OpenAI’s rubik’s cube solver in that it uses an evolutionary algorithm, maintaining a population of walkers and environments. The structure of having populations of agents and environments is key to structuring the evolution of complexity in this research. Despite the use of Reinforcement Learning to train a single agent compared to Population-based Learning to adapt a group of agents, POET and Automatic Domain Randomization are very similar. They both develop a progression of increasingly challenging training datasets in an automated way. The Bipedal’s walking environment does not change as a manually encoded function, but rather as a result of the population of walkers performances in the different environments, signaling when it is time to crank up the challenge of the terrain.

Data or Models?

Data Augmentation.

Data Augmentation is most easily understood in the context of image data, although we have already seen how the physics data can be augmented and randomized as well. These image augmentations typically include horizontal flips and small magnitudes of rotations or translations. This kind of augmentation is typical in any computer vision pipeline such as image classification, object detection, or super resolution.

Curriculum Learning is another data-level optimization concerned with the order in which data is presented to learning models. For example, starting a student off with easy examples such as 2 + 2 = 4, before introducing more difficult ideas such as 2³ = 8. Meta-Learning controllers of Curriculum Learning look at how data is ranked according to some metric of perceived difficulty and the order in which this data should be presented in. A recent study from Hacohen and Weinshall presents interesting success with this (shown below) in the ICML 2019 conference.

“On the Power of Curriculum Learning in Training Deep Networks” by Guy Hacohen and Daphan Weinshall. Vanilla SGD data selection shown on the gray bar on the far left is outperformed by curriculum learning methods

paperswithcode.com.

Meta-learning neural architectures try to describe a space of possible architectures and then search for the best architecture according to one or multiple objective metrics.

Advanced Meta Learners

AutoAugment from Google.

How Expressive is Meta Learning?

One of the limitations of Meta-Learning frequently addressed in Neural Architecture Search is the constraint of the search space. Neural Architecture Searches begin from a manually designed encoding of possible architectures. This manual encoding naturally limits the discoveries possible by the search. However, there is a trade-off necessary to make the search computable at all.

Current architecture searches view neural architectures as Directed Acyclic Graphs (DAGs) and try to optimize the connections between nodes. Papers such as “Weight Agnostic Neural Networks” by Gaier and Ha and “Exploring Randomly Wired Neural Networks for Image Recognition” by Xie et al. show that constructing DAG neural architectures is complex and not well understood.

Batch Normalization.

Gary Marcus?

it doesn’t seem like the current state-of-the-art generative models such as BigGAN or VQ-VAE-2 works for data augmentation on ImageNet classification.

Transfer and Meta Learning

identifying steel defects.

An interesting result of the Rubik’s Cube Solver is the ability to adapt to perturbations. For example, the solver is able to continue despite putting a rubber glove on the hand, tying fingers together, and blanket occlusion of the cube, (the vision model must be completely impaired, thus the sensing has to be done by the Giiker cube’s sensors). This kind of Transfer Meta-Learning is a result of the LSTM layers in the policy network used to train the robotic hand control. I think this use of “Meta-Learning” is more of a characteristic of Memory Augmented Networks compared to AutoML optimization. I think this demonstrates the difficulty of unifying Meta-Learning and settling on a single definition for the term.

Conclusion

escribed in Jeff Clune’s AI-GAs, of algorithms that contain meta-learning architectures, meta-learning the learning algorithms themselves, and generating effective learning environments stand to be an enormous opportunity for the advancement of Deep Learning and Artificial Intelligence. Thank you for reading!