Illustrating the Reformer

2020-03-02 13:39:12

Source: https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0

French

,and you probably know it’s different from the one that twitter thinks:

Fig 1. Icebreaker, taken from Prof. Chris Manning’s Twitter

Reformer model is expected to have a significant impact on the filed by going beyond language applications (e.g. music, speech, image and video generation).

Reformer model and try to understand it with some visual guides. Ready? ????

Why Transformer?

Google’s Neural Machine Translation System. However, the inherently sequential nature of recurrence in LSTMs, was the biggest obstacle in parallelization of computation over the sequence of data (in terms of speed and vanishing gradients), and as a result, those architectures could not take advantage of the context over long sequences.

images.

What’s missing from the Transformer?

The Illustrated Transformer post is the greatest visual explanation so far, and I highly encourage reading his post before going through the rest of this post.

???? Although transformer models yield great results being used on increasingly long sequences — e.g. 11K long text examples in (Liu et al., 2018) — many of such large models can only be trained in large industrial compute platforms and even cannot be fine-tuned on a single GPU even for a single training step due to their memory requirements. For example, the full GPT-2 model consists of roughly 1.5B parameters. The number of parameters in the largest configuration reported in (Shazeer et al., 2018) exceeds 0.5B per layer, while the number of layers goes up to 64 in (Al-Rfou et al., 2018).

???? Let’s look at a simplified overview of the Transformer model:

Fig. 2: A simplified summary of the standard Transformer model [Image inspired by ‘The Illustrated Transformer’]

The Illustrated Transformer post.

You may notice there exist some ????’s in the diagram with 3 different colors. Each of these unique ????’s represents a part of the Transformer model that the Reformer authors looked at as sources of computation and memory issues:

???? Problem 1 (Red ????): Attention computation

O(L²) (both time and memory). Imagine what happens if we have a sequence of length 64K.

???? Problem 2 (Black ????): Large number of layers

N-times larger memory than a single-layer model, as activations in each layer need to be stored for back-propagation.

???? Problem 3 (Green ????): Depth of feed-forward layers

The depth of intermediate feed-forward layers is often much larger than the depth of attention activations.

16GB of memory.

reversible residual layers to more efficiently use the memory available.

Below we go into further details.

???? 1. Locality sensitive hashing (LSH) Attention

???? Attention and nearest neighbors

Attention in deep learning is a mechanism that enables the network to focus attentively on different parts of a the context based on their relativeness to the current timestep. There exist 3 types of attention mechanism in the transformer model as below:

Fig. 3: Three types of attention in the Transformer model [Image from Łukasz Kaiser’s presentation]

The standard attention used in the Transformer is the scaled dot-product, formulated as:

O(L²), which is the main memory bottleneck.

Fig. 4: (left): main computation in dot-product attention, (right) an example of a token (‘it’) paying attention to a subset of the other tokens in the sequence (‘the’, ‘animal’, ‘street’, ‘it’, ‘.’)

nearest neighbors' search?

L log L).

???? LSH for nearest neighbors search

hash(q) == hash(p)’.

sign(pᵀH) as to the hash code of each point. Let’s look at an example below:

Fig. 5: A simplified animation of Locality Sensitive Hashing for nearest neighbors search

hash(a) == hash(b). Now the search space to find the nearest neighbors of each point reduces dramatically from the whole data set into the bucket where it belongs to.

angular LSH, projects the points on a unit sphere which has been divided into predefined regions each with a distinct code. Then a series of random rotations of points define the bucket the points belong to. Let’s illustrate this through a simplified 2D example, taken from the Reformer paper:

Fig. 6: A simplified animation of Angular LSH for nearest neighbors search: two points are apart [Animation created based on the example in the paper]

Here we have two points that are projected onto a unit circle and rotated randomly 3 times with different angles. We can observe that they are unlikely to share the same hash bucket. In the next example, however, we see the two points that are pretty close to each other will end up sharing the same hash buckets after 3 random rotations:

Fig. 7: A simplified animation of Angular LSH for nearest neighbors search: two points are close [Animation created based on the example in the paper]

???? LSH attention

K matrices, we do the following:

K matrices.
q vectors within the same hash buckets.

Muti-round LSH attention: Repeat the above procedure a few times to increase the probability that similar items do not fall in different buckets.

The animation below illustrates a simplified version of LSH Attention based on the figure from the paper.

Fig. 6: A simplified illustration of LSH Attention mechanism [Animation created based on the example in the paper]

???? 2. Reversible Transformer and Chunking

N) encoder and decoder layers and the depth of the feedforward layers.

???? Reversible residual Network (RevNet)

to calculate gradients during backpropagation. The memory cost is proportional to the number of units in the network.

Y₁, Y₂).

Fig. 6: Illustration of Residual Network Blocks (left) and Reversible Residual Blocks (right)

???? Reversible Transformer

G becomes the feed-forward layer:

Y₁ = X₁ + Attention(X₂),
Y₂= X₂+ FeedForward(Y₁)

N times.

???? Chunking

The last portion of efficiency improvements in the Reformer deal with the 3rd problem, i.e. high dimensional intermediate vectors of the feed-forward layers — that can go up to 4K and higher in dimensions.

Due to the fact that computations in feed-forward layers are independent across positions in a sequence, the computations for the forward and backward passes as well as the reverse computation can be all split into chunks. For example, for the forward pass we will have:

Chunking in the forward pass computation [Image is taken from the Reformer paper]

???? Experimental Results

enwik8(with sequences of length 64K), and evaluated the effect of reversible Transformer and LSH hashing on the memory, accuracy, and speed.

???? Reversible Transformer matches baseline: Their experiment results showed that the reversible Transformer saves memory without sacrificing accuracy:

Effect of reversibility on performance on enwik8 and imagenet64 training [Images and caption are taken from the Reformer paper].

???? LSH attention matches baseline:????Note that as LSH attention is an approximation of full attention, its accuracy improves as the hash value increases. When the hash value is 8, LSH attention is almost equivalent to full attention:

Effect of LSH attention as a function of hashing rounds on imagenet64 [Image and caption are taken from the Reformer paper].

???? They also demonstrated that the conventional attention slows down as the sequence length increases, while LSH attention speed remains steady, and it runs on sequences of length ~100K at usual speed on 8GB GPUs:

Speed of attention evaluation as a function of input length for full- and LSH- attention [Image and caption are taken from the Reformer paper].

The final Reformer model performed similarly compared to the Transformer model, but showed higher storage efficiency and faster speed on long sequences.

???? Trax: Code and examples

text generation tasks.

???? Acknowledgment

Abraham Kang for his deep review and constructive feedback.

???? References and related links:

Reformer: The Efficient Transformer

Understanding sequential data - such as language, music or videos - is a challenging task, especially when there is…

ai.googleblog.com

Transformer: A Novel Neural Network Architecture for Language Understanding

Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to…

ai.googleblog.com

Reformer: The efficient Transformer
Google/Trax deep learning library
The Illustrated Transformer
Huggingface/Transformers NLP library
Attention is all you need
Open AI’s GPT2 language model
Write with transformers
Talk to Transformers
Google’s Neural Machine Translation System