整理各种模型的公式,以后面试复习用
RNN
公式:h t = f ( W ⋅ [ h t − 1 , x t ] + b ) h_{t}=f\left(W \cdot\left[h_{t-1}, x_{t}\right]+b\right) h t = f ( W ⋅ [ h t − 1 , x t ] + b )
LSTM
公式:
遗忘门:f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f ) f_{t}=\sigma\left(W_{f} \cdot\left[h_{t-1}, x_{t}\right]+b_{f}\right) f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f )
输入门:i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) i_{t}=\sigma\left(W_{i} \cdot\left[h_{t-1}, x_{t}\right]+b_{i}\right) i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i )
细胞状态:C ~ t = tanh ( W C ⋅ [ h t − 1 , x t ] + b C ) \tilde{C}_{t}=\tanh \left(W_{C} \cdot\left[h_{t-1}, x_{t}\right]+b_{C}\right) C ~ t = tanh ( W C ⋅ [ h t − 1 , x t ] + b C )
细胞更新:C t = f t ∗ C t − 1 + i t ∗ C ~ t C_{t}=f_{t} * C_{t-1}+i_{t} * \tilde{C}_{t} C t = f t ∗ C t − 1 + i t ∗ C ~ t
输出门:o t = σ ( W o [ h t − 1 , x t ] + b o ) o_{t}=\sigma\left(W_{o}\left[h_{t-1}, x_{t}\right]+b_{o}\right) o t = σ ( W o [ h t − 1 , x t ] + b o )
输出:h t = o t ∗ tanh ( C t ) h_{t}=o_{t} * \tanh \left(C_{t}\right) h t = o t ∗ tanh ( C t )
GRU
公式:
更新门:z t = σ ( W z ⋅ [ h t − 1 , x t ] ) z_{t}=\sigma\left(W_{z} \cdot\left[h_{t-1}, x_{t}\right]\right) z t = σ ( W z ⋅ [ h t − 1 , x t ] )
重置门:r t = σ ( W r ⋅ [ h t − 1 , x t ] ) r_{t}=\sigma\left(W_{r} \cdot\left[h_{t-1}, x_{t}\right]\right) r t = σ ( W r ⋅ [ h t − 1 , x t ] )
当前状态:h ~ t = tanh ( W ⋅ [ r t ∗ h t − 1 , x t ] ) \tilde{h}_{t}=\tanh \left(W \cdot\left[r_{t} * h_{t-1}, x_{t}\right]\right) h ~ t = tanh ( W ⋅ [ r t ∗ h t − 1 , x t ] )
更新:h t = ( 1 − z t ) ∗ h t − 1 + z t ∗ h ~ t h_{t}=\left(1-z_{t}\right) * h_{t-1}+z_{t} * \tilde{h}_{t} h t = ( 1 − z t ) ∗ h t − 1 + z t ∗ h ~ t
Attention机制
Attention有很多计算方法,下面的公式只是比较常用的一种,计算方法和transformer中的qkv类似,下面公式以解码器第一个状态为例,Encoder输入长度为m,W \mathrm{W} W 为参数,自动学习获得。
公式:
计算 q \mathrm{q} q :q 0 = W Q ⋅ s 0 \mathrm{q}_{0}=\mathbf{W}_{Q} \cdot \mathrm{s}_{0} q 0 = W Q ⋅ s 0
计算 k \mathrm{k} k :k i = W K ⋅ h i , \mathrm{k}_{i}=\mathbf{W}_{K} \cdot \mathbf{h}_{i}, k i = W K ⋅ h i , for i = 1 i=1 i = 1 to m m m
计算每个位置得分:α ~ i = k i T q 0 , \tilde{\alpha}_{i}=\mathrm{k}_{i}^{T} \mathrm{q}_{0}, α ~ i = k i T q 0 , for i = 1 i=1 i = 1 to m m m
softmax归一化:[ α 1 , ⋯ , α m ] = Softmax ( [ α ~ 1 , ⋯ , α ~ m ] ) \left[\alpha_{1}, \cdots, \alpha_{m}\right]=\operatorname{Softmax}\left(\left[\tilde{\alpha}_{1}, \cdots, \tilde{\alpha}_{m}\right]\right) [ α 1 , ⋯ , α m ] = S o f t m a x ( [ α ~ 1 , ⋯ , α ~ m ] ) (softmax公式想必很熟了)
最后,计算得到当前的 context vector:c 0 = α 1 h 1 + ⋯ + α m h m c_{0}=\alpha_{1} \mathbf{h}_{1}+\cdots+\alpha_{m} \mathbf{h}_{m} c 0 = α 1 h 1 + ⋯ + α m h m
Transformer
transformer的公式不太好写,下面只给出几个关键公式
公式:
计算Q 、 K 、 V Q、K、V Q 、 K 、 V :Q = W Q ∗ X Q=W^{Q} * X Q = W Q ∗ X ,K = W K ∗ X K=W^{K} * X K = W K ∗ X ,V = W V ∗ X V=W^{V} * X V = W V ∗ X
计算self Attention:Attention ( Q , K , V ) = softmax ( Q K T d k ) V (Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V ( Q , K , V ) = s o f t m a x ( d k Q K T ) V
前馈网络层:FFN ( Z ) = max ( 0 , Z W 1 + b 1 ) W 2 + b 2 \operatorname{FFN}(Z)=\max \left(0, Z W_{1}+b_{1}\right) W_{2}+b_{2} F F N ( Z ) = max ( 0 , Z W 1 + b 1 ) W 2 + b 2
位置编码:P E ( p o s , 2 i ) = sin ( p o s 1000 0 2 i d m o d e l ) P E(p o s, 2 i)=\sin \left(\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}}\right) P E ( p o s , 2 i ) = sin ( 1 0 0 0 0 d m o d e l 2 i p o s ) ,P E ( p o s , 2 i + 1 ) = cos ( p o s 1000 0 2 i d m o d e l ) P E(p o s, 2 i+1)=\cos \left(\frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}}\right) P E ( p o s , 2 i + 1 ) = cos ( 1 0 0 0 0 d m o d e l 2 i p o s )
上面公式只是一部分,其中还有一些细节 ,比如mutil-attention、残差&layer norm、decoder中的mask等。
transformer的核心组件就是Attention,代码实现是用上述矩阵乘的方式,为方便理解下面简述单个单词的Attention计算流程:
根据embeding得到 q , k , v q, \quad k, \quad v q , k , v 三个向量;
用当前单词的q q q 为其它每个单词计算一个score: \quad score = q ⋅ k =q \cdot k = q ⋅ k ;
为了避免score分布尖锐, 进行数值缩放, 即除以 d k \sqrt{d_{k}} d k
对score进行softmax归一化;
加权求和得到当前词的context vector z : z = ∑ α i v i z: \quad z=\sum \alpha_{i}v_{i} z : z = ∑ α i v i
仔细观察,和上一节的Attention计算方式几乎一样。
持续更新中……