2020-03-02 13:39:12
Source: https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0
French
,and you probably know it’s different from the one that twitter thinks:
Reformer model is expected to have a significant impact on the filed by going beyond language applications (e.g. music, speech, image and video generation).
Reformer model and try to understand it with some visual guides. Ready? ????
Why Transformer?
Google’s Neural Machine Translation System. However, the inherently sequential nature of recurrence in LSTMs, was the biggest obstacle in parallelization of computation over the sequence of data (in terms of speed and vanishing gradients), and as a result, those architectures could not take advantage of the context over long sequences.
images.
What’s missing from the Transformer?
The Illustrated Transformer post is the greatest visual explanation so far, and I highly encourage reading his post before going through the rest of this post.
???? Although transformer models yield great results being used on increasingly long sequences — e.g. 11K long text examples in (Liu et al., 2018) — many of such large models can only be trained in large industrial compute platforms and even cannot be fine-tuned on a single GPU even for a single training step due to their memory requirements. For example, the full GPT-2 model consists of roughly 1.5B parameters. The number of parameters in the largest configuration reported in (Shazeer et al., 2018) exceeds 0.5B per layer, while the number of layers goes up to 64 in (Al-Rfou et al., 2018).
???? Let’s look at a simplified overview of the Transformer model:
The Illustrated Transformer post.
You may notice there exist some ????’s in the diagram with 3 different colors. Each of these unique ????’s represents a part of the Transformer model that the Reformer authors looked at as sources of computation and memory issues:
???? Problem 1 (Red ????): Attention computation
O(L²) (both time and memory). Imagine what happens if we have a sequence of length 64K.
???? Problem 2 (Black ????): Large number of layers
N-times larger memory than a single-layer model, as activations in each layer need to be stored for back-propagation.
???? Problem 3 (Green ????): Depth of feed-forward layers
The depth of intermediate feed-forward layers is often much larger than the depth of attention activations.
16GB of memory.
reversible residual layers to more efficiently use the memory available.
Below we go into further details.
???? 1. Locality sensitive hashing (LSH) Attention
???? Attention and nearest neighbors
Attention in deep learning is a mechanism that enables the network to focus attentively on different parts of a the context based on their relativeness to the current timestep. There exist 3 types of attention mechanism in the transformer model as below:
The standard attention used in the Transformer is the scaled dot-product, formulated as:
O(L²), which is the main memory bottleneck.
nearest neighbors' search?
L log L).
???? LSH for nearest neighbors search
hash(q) == hash(p)’.
sign(pᵀH) as to the hash code of each point. Let’s look at an example below:
hash(a) == hash(b). Now the search space to find the nearest neighbors of each point reduces dramatically from the whole data set into the bucket where it belongs to.
angular LSH, projects the points on a unit sphere which has been divided into predefined regions each with a distinct code. Then a series of random rotations of points define the bucket the points belong to. Let’s illustrate this through a simplified 2D example, taken from the Reformer paper:
Here we have two points that are projected onto a unit circle and rotated randomly 3 times with different angles. We can observe that they are unlikely to share the same hash bucket. In the next example, however, we see the two points that are pretty close to each other will end up sharing the same hash buckets after 3 random rotations:
???? LSH attention
K matrices, we do the following:
- K matrices.
- q vectors within the same hash buckets.
Muti-round LSH attention: Repeat the above procedure a few times to increase the probability that similar items do not fall in different buckets.
The animation below illustrates a simplified version of LSH Attention based on the figure from the paper.
???? 2. Reversible Transformer and Chunking
N) encoder and decoder layers and the depth of the feedforward layers.
???? Reversible residual Network (RevNet)
to calculate gradients during backpropagation. The memory cost is proportional to the number of units in the network.
Y₁, Y₂).
???? Reversible Transformer
N.
G becomes the feed-forward layer:
Y₁ = X₁ + Attention(X₂),
Y₂= X₂+ FeedForward(Y₁)
N times.
???? Chunking
The last portion of efficiency improvements in the Reformer deal with the 3rd problem, i.e. high dimensional intermediate vectors of the feed-forward layers — that can go up to 4K and higher in dimensions.
Due to the fact that computations in feed-forward layers are independent across positions in a sequence, the computations for the forward and backward passes as well as the reverse computation can be all split into chunks. For example, for the forward pass we will have:
???? Experimental Results
enwik8(with sequences of length 64K), and evaluated the effect of reversible Transformer and LSH hashing on the memory, accuracy, and speed.
???? Reversible Transformer matches baseline: Their experiment results showed that the reversible Transformer saves memory without sacrificing accuracy:
???? LSH attention matches baseline:????Note that as LSH attention is an approximation of full attention, its accuracy improves as the hash value increases. When the hash value is 8, LSH attention is almost equivalent to full attention:
???? They also demonstrated that the conventional attention slows down as the sequence length increases, while LSH attention speed remains steady, and it runs on sequences of length ~100K at usual speed on 8GB GPUs:
The final Reformer model performed similarly compared to the Transformer model, but showed higher storage efficiency and faster speed on long sequences.
???? Trax: Code and examples
text generation tasks.
???? Acknowledgment
Abraham Kang for his deep review and constructive feedback.
???? References and related links:
- Reformer: The efficient Transformer
- Google/Trax deep learning library
- The Illustrated Transformer
- Huggingface/Transformers NLP library
- Attention is all you need
- Open AI’s GPT2 language model
- Write with transformers
- Talk to Transformers
- Google’s Neural Machine Translation System