学习笔记之Transformer Self-Attention机制

Transformer

台大李宏毅教授链接

Self-Attention

传统RNN不容易平行化，比如b4就得知道a1，a2，a3，a4才能算出来

学习笔记之Transformer Self-Attention机制

使用CNN可以实现平行化，比如图中的一个黄三角形代表一个filter，他可以并行执行的。
- 在更高层filter的layer可以获取到更长的信息，比如蓝三角形，它的输入时第一层的输出

学习笔记之Transformer Self-Attention机制

Self-Attention可以替代双向RNN
可以并行计算
能够获得一个整句的信息

学习笔记之Transformer Self-Attention机制

做 attention ：吃两个向量，输出一个分数，代表它会有多匹配

Scaled Dot-Product Attentio

学习笔记之Transformer Self-Attention机制

为什么除以根号d？

qk的dim越大，想乘之后的varience越大，除以d来平衡。根号d只是作者自己设置的。。

上面得到的a再做softmax得到a-hat

学习笔记之Transformer Self-Attention机制

输出的b1已经是考虑到句子中所有信息了。如果不想考虑某一个信息只需要让a-hat为0

学习笔记之Transformer Self-Attention机制

在同一时间可以算b2，b3…

学习笔记之Transformer Self-Attention机制

Q\K\V的并行计算

学习笔记之Transformer Self-Attention机制

把a1a2a3a4拼接起来就得到一个矩阵，乘以Wq就是Q
同理K,V

a

学习笔记之Transformer Self-Attention机制

通过一个矩阵运算得出A

学习笔记之Transformer Self-Attention机制

每一列做softmax

通过矩阵运算得出O

学习笔记之Transformer Self-Attention机制

总结一下并行的矩阵运算

学习笔记之Transformer Self-Attention机制

Multi-head Self-Attention

学习笔记之Transformer Self-Attention机制

有的head想看local的信息，有的head想看global的信息。它们各司其职

考虑Input的位置信息

每一个ei都是不同的

思考：为什么是+不是连接？（其实是一样的）

学习笔记之Transformer Self-Attention机制

WP的样子

学习笔记之Transformer Self-Attention机制

可以替代RNN了

学习笔记之Transformer Self-Attention机制

Encoder & Decoder

学习笔记之Transformer Self-Attention机制

详细的讲解

学习笔记之Transformer Self-Attention机制

Encoder部分：

Input 通过一个 Input Embedding 变成一个 vector
vector 加上 Positional Encoding 之后进入 gray block（重复N次）
通过 Multi-Head Attention 层得到 b1 b2 b3 b4
Add & Norm 输入 a 加上输出 b 得到 b’

Layer Normlization 各个不同 dim 的 μ = 0 σ = 1

Decoder部分：

前一个 Encoder 产生的 Outputs 初始为
Masked Multi-Head Attention 做 Self-Attention 时 Decoder 会加到已产生的 Sequence 中

重复部分不再冗余的说

Multi-Head Attention效果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7ffZ6Qwg-1578541373227)(https://s2.ax1x.com/2020/01/09/lRxtk8.png)]

时 Decoder 会加到已产生的 Sequence 中

重复部分不再冗余的说

Multi-Head Attention效果

学习笔记之Transformer Self-Attention机制

Transformer

Self-Attention

Scaled Dot-Product Attentio

为什么除以根号d？

Q\K\V的并行计算

a

通过一个矩阵运算得出A

通过矩阵运算得出O

总结一下并行的矩阵运算

Multi-head Self-Attention

考虑Input的位置信息

思考：为什么是+不是连接？（其实是一样的）

WP的样子

可以替代RNN了

Encoder & Decoder

详细的讲解

Multi-Head Attention效果

Multi-Head Attention效果