定义符号:
Xi=j=1NXi,jPi,k=Xi,kXiratioi,j,k=Pi,kPj,k X_i = \sum_{j=1}^N{X_{i,j}}\\ P_{i,k} = \frac{X_{i,k}}{X_i}\\ ratio_{i,j,k} = \frac{P_{i,k}}{P_{j,k}}

ratioi,j,k的值 单词j,k相关 单词j,k不相关
单词i,k相关 趋近1 很大
单词i,k不相关 很小 趋近1

推导:
假设已经得到词向量,则词向量和共现矩阵应该具有很好的一致性。假设词向量$v_i ,v_j, v_k$计算ratioi,j,kratio_{i,j,k}的函数为g(wi,wj,wk)g(w_i ,w_j ,w_k),则:

Pi,kPj,k=ratioi,j,k=g(wi,wj,wk) \frac{P_{i,k}}{P_{j,k}} = ratio_{i,j,k} = g(w_{i},w_{j},w_{k})
需要等式左右尽可能接近,所以代价函数:
J=i,j,kN(Pi,kPj,kg(wi,wj,wk))2 J = \sum_{i,j,k}^N(\frac{P_{i,k}}{P_{j,k}}-g(w_{i},w_{j},w_{k}))^2
但是模型包括三个单词,复杂度NNNN*N*N
如何简化:

  1. 要考虑单词i和j之间的关系,则g大概会有wiwjw_i - w_j;
  2. ratioi,j,kratio_{i,j,k}是标量,g也应该是标量,所以g应该包含(wiwj)Twk(w_i-w_j)^Tw_k;
  3. 再套上指数运算exp()exp(),最终g(wi,wj,wk)=exp((wiwj)Twk)g(w_i,w_j,w_k) = exp((w_i-w_j)^Tw_k)

Pi,kPj,k=g(wi,wj,wk)Pi,kPj,k=exp((wiwj)Twk)Pi,kPj,k=exp(wiTwkwjTwk)Pi,kPj,k=exp(wiTwk)exp(wjTwk) \frac{P_{i,k}}{P_{j,k}} = g(w_i,w_j,w_k)\\ \frac{P_{i,k}}{P_{j,k}} = exp((w_i-w_j)^Tw_k)\\ \frac{P_{i,k}}{P_{j,k}} = exp(w_i^Tw_k-w_j^Tw_k)\\ \frac{P_{i,k}}{P_{j,k}} = \frac{exp(w_i^Tw_k)}{exp(w_j^Tw_k)}
可以看出:
Pi,j=exp(wiTwj) P_{i,j} = exp(w_i^Tw_j) log(Xi,j)log(Xi)=wiTwj log(X_{i,j}) - log(X_i) = w_i^Tw_j log(Xi,j)=wiTwj+bi+bj log(X_{i,j}) = w_i^Tw_j+b_i+b_j
损失函数变为:
J=i,jN(wiTwj+bi+bjlog(Xi,j))2 J = \sum_{i,j}^N(w_i^Tw_j+b_i+b_j-log(X_{i,j}))^2
矩阵分解方法,有个缺点,就是各个词的权重是一样的
基于出现频率越高的词对权重应该越大的原则,损失函数添加权重项:
J=i,jNf(Xi,j)(viTvj+bi+bjlog(Xi,j))2 J = \sum_{i,j}^Nf(X_{i,j})(v_i^Tv_j+b_i+b_j-log(X_{i,j}))^2 f(x)={(x/xmax)0.75,if x<xmax1,if x>=xmax f(x) = \begin{cases} (x/xmax)^{0.75}, &\text{if } x < xmax \\ 1, &\text{if } x>=xmax \end{cases}

Glove公式推导

相关文章: