NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

在Encoder-Decoder结构中，Encoder把所有的输入序列都编码成一个统一的语义特征c再解码，因此， c中必须包含原始序列中的所有信息，它的长度就成了限制模型性能的瓶颈。如机器翻译问题，当要翻译的句子较长时，一个c可能存不下那么多信息，就会造成翻译精度的下降。

Attention机制通过在每个时间输入不同的c来解决这个问题，下图是Attention机制的encoder and Decoder：

4. self-attention : 其输入和输出和RNN一样,就是中间不一样. 如下图, b1到b4是同时计算出来, RNN的b4必须要等到b1计算完.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

二.Attention

1. 为什么要用attention model？

The attention model用来帮助解决机器翻译在句子过长时效果不佳的问题。并且可以解决RNN难并行的问题.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　3. attentionl类型

　　　　点积注意力机制的优点是速度快、占用空间小。

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

三. self-attention

　1. self-attention 的计算(Attention is all you need)

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　用每个query q去对每个key k做attention , 即计算得到α_1,1 , α_1,2 ……,

　　为什么要除以d [d等于q或k的维度,两者维度一样] ? 因为q和k的维度越大,dot product 之后值会更大,为了平衡值,相当于归一化这个值,除以一个d.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

2. self-attention如何并行

　　self-attention最终为一些矩阵相乘的形式,可以采用并行方式来计算.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　以上每个α都可以并行计算

　　 NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

3. 计算总结:

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

4. self_attention的类型

多头: 为何?因为不同的head可以关注不同的信息, 比如第一个head关注长时间的信息,第二个head关注短时间的信息.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

将两个b^i,1和b^i,2进行concat并乘以W⁰来降为成bⁱ

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

四. seq2seq

　　传统的seq2seq: 中间用的是RNN

　　 NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　seq2seq with attention

　　 NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

五. Transformer

细扣 : https://mp.weixin.qq.com/s/RLxWevVWHXgX-UcoxDS70w

1. 整体架构:

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

Transformer遵循这种结构，encoder和decoder都使用堆叠的self-attention和point-wise，fully connected layers。

Encoder: encoder由6个相同的层堆叠而成，每个层有两个子层。

第一个子层是多头自我注意力机制(multi-head self-attention mechanism)，

第二层是简单的位置的全连接前馈网络(position-wise fully connected feed-forward network)。

　　中间: 两个子层中会使用一个残差连接，接着进行层标准化(layer normalization)。

　　也就是说每一个子层的输出都是LayerNorm(x + sublayer(x))。

网络输入是三个相同的向量q, k和v，是word embedding和position embedding相加得到的结果。为了方便进行残差连接，我们需要子层的输出和输入都是相同的维度。

Decoder:

　　三层: (多头self-attention + 多头attention + feed-forword )

　　　　 decoder也是由N（N=6）个完全相同的Layer组成，decoder中的Layer由encoder的Layer中插入一个Multi-Head Attention + Add&Norm组成。

　　　　输入 : 输出的embedding与输出的position embedding求和做为decoder的输入，

　　　　MA-1层: 经过一个Multi-HeadAttention + Add&Norm（（MA-1）层，MA-1层的输出做为下一Multi-Head Attention + Add&Norm（MA-2）的query（Q）输入，

　　　　MA-2层的Key和Value输入（从图中看，应该是encoder中第i（i = 1,2,3,4,5,6）层的输出对于decoder中第i（i = 1,2,3,4，5,6）层的输入）。

　　　　　　MA-2层的输出输入到一个前馈层（FF）, 层与层之间使用的Position-wise feed forward network，经过AN(Add&norm)操作后，经过一个线性+softmax变换得到最后目标输出的概率。
　　　　mask : 对于decoder中的第一个多头注意力子层，需要添加masking，确保预测位置i的时候仅仅依赖于位置小于i的输出。
　　　　

2. trip细节

(1) 三种应用

Transformer会在三个不同的方面使用multi-head attention：
1. encoder-decoder attention：使用multi-head attention，输入为encoder的输出和decoder的self-attention输出，其中encoder的self-attention作为 key and value，decoder的self-attention作为query

2. encoder self-attention：使用 multi-head attention，输入的Q、K、V都是一样的（input embedding and positional embedding）
3. decoder self-attention：在decoder的self-attention层中，deocder 都能够访问当前位置前面的位置

(2)位置encoding

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

这样做的目的是因为正弦和余弦函数具有周期性，对于固定长度偏差k（类似于周期），post +k位置的PE可以表示成关于pos位置PE的一个线性变化（存在线性关系），这样可以方便模型学习词与词之间的一个相对位置关系。

　　上面的self-attention有个问题,q缺乏位置信息,因为近邻和长远的输入是同等的计算α.

　　位置encoding的eⁱ是人工设置的,不是学习的.将其加入aⁱ中.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　为何是和ai相加,而不是concat?

　　 NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　这里的W^p是通过别的方法计算的,如下图所示

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

(3) 残差

对于每个encoder里面的每个sub-layer，它们都有一个残差的连接，理论上这可以回传梯度.

这种方式理论上可以很好的回传梯度

作者：收到一只叮咚
链接：https://www.imooc.com/article/67493
来源：慕课网

(4) Layer Norm

每个sub-layer后面还有一步 layer-normalization [layer Norm一般和RNN相接] 。可以加快模型收敛速度.

Batch Norm和Layer Norm 的区别, 下图右上角, 横向为batch size取均值为0, sigma = 1. 纵向为layer Norm , 不需要batch size.

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

(5) Position-wise feed forward network 前馈神经网络

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

用了两层Dense层，activation用的都是Relu。

可以看成是两层的1*1的1d-convolution。hidden_size变化为：512->2048->512
Position-wise feed forward network，其实就是一个MLP 网络，1 的输出中，每个 d_model 维向量 x 在此先由 xW_1+b_1 变为 d_f $维的 x'，再经过max(0,x')W_2+b_2 回归 d_model 维。之后再是一个residual connection。输出 size 仍是 $[sequence_length, d_model]$

(6) Masked : [decoder]

注意encoder里面是叫self-attention，decoder里面是叫masked self-attention。

这里的masked就是要在做language modelling（或者像翻译）的时候，不给模型看到未来的信息。

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

mask就是沿着对角线把灰色的区域用0覆盖掉，不给模型看到未来的信息。

(7) 优化

模型的训练采用了Adam方法，文章提出了一种叫warm up的学习率调节方法，如公式所示：

作者：收到一只叮咚
链接：https://www.imooc.com/article/67493
来源：慕课网

　　发展: universal transformer

　　应用: NLP \ self attention GAN (用在图像上)

3. 实战

https://www.jianshu.com/p/2b0a5541a17c

3.1 encoder

　　(1) 输入: encoder embedding和position embedding相加

　　(2) 两种attention

　　(3) Add & Normalize & FFN

3.2 decoder

　　(1)输入: decoder embedding和position embedding相加

　　(2)mask multi-head attention和encoder-decoder attention

　　(3)Add & Normalize & FFN & 输出

3.1 encoder

(1)输入: input embedding和position embedding相加

　　原始数据: word2vec [embedding表] + input_sentence [x] + output_sentence [y] + position embedding(固定)

　　①输入input_sentence [x] 和 word2vec [embedding表]

假设我们有两条训练数据（input_sentence [x]）：

[机、器、学、习] -> [ machine、learning]
[学、习、机、器] -> [learning、machine]

encoder的输入在转换成id后变为[[0,1,2,3],[2,3,0,1]]。

接下来，通过查找中文的embedding表(word2vec)，转换为embedding为：

NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　　　②将position embedding设为固定值,但实际是通过三角函数来计算得到的,这里为了方便设为固定值,注意这个position embedding是不用迭代训练的:

　　　　　　 NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　　　③对输入input_embedding加入位置偏置position_embedding，注意这里是两个向量的对位相加：

　　　　 NLP学习(5)----attention/ self-attention/ seq2seq/ transformer

　　④output_sentence [y]和input_sentence做相同的处理

代码:　

import tensorflow as tf

chinese_embedding = tf.constant([[0.11,0.21,0.31,0.41],
                         [0.21,0.31,0.41,0.51],
                         [0.31,0.41,0.51,0.61],
                         [0.41,0.51,0.61,0.71]],dtype=tf.float32)


english_embedding = tf.constant([[0.51,0.61,0.71,0.81],
                         [0.52,0.62,0.72,0.82],
                         [0.53,0.63,0.73,0.83],
                         [0.54,0.64,0.74,0.84]],dtype=tf.float32)


position_encoding = tf.constant([[0.01,0.01,0.01,0.01],
                         [0.02,0.02,0.02,0.02],
                         [0.03,0.03,0.03,0.03],
                         [0.04,0.04,0.04,0.04]],dtype=tf.float32)

encoder_input = tf.constant([[0,1,2,3],[2,3,0,1]],dtype=tf.int32)


with tf.variable_scope("encoder_input"):
    encoder_embedding_input = tf.nn.embedding_lookup(chinese_embedding,encoder_input)
    encoder_embedding_input = encoder_embedding_input + position_encoding


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run([encoder_embedding_input]))

View Code

(1) RNN（N vs N）	(2) RNN (N vs 1)

(3) RNN (1 vs N)	(4) RNN (N vs M)---seq2seq

目录:

二.Attention

3. attentionl类型

三. self-attention

1. self-attention 的计算(Attention is all you need)

2. self-attention如何并行

3. 计算总结:

4. self_attention的类型

四. seq2seq

五. Transformer

1. 整体架构:

2. trip细节

(1) 三种应用

(2)位置encoding

(3) 残差

3. 实战

3.1 encoder

(1)输入: input embedding和position embedding相加

　　3. attentionl类型

　1. self-attention 的计算(Attention is all you need)