GloVe: Global Vectors for Word Representation


J. Pennington, R. Socher, C. D. Manning, GloVe: Global Vectors for Word Representation, EMNLP (2014)


摘要

现有单词向量空间表示学习(learning vector space representations of words)通过向量运算(vector arithmetic)获取精细语义和语法规则(fine-grained semantic and syntactic regularities),但这些规则可解释性很差(these regularities has remained opaque)。

本文对能够生成融合语义、语法规则词向量的模型所需属性进行分析(analyze and make explicit the model properties needed for such regularities to emerge in word vectors),得到全局对数双线性回归模型(global log-bilinear regression model)。该模型兼具全局矩阵分解(global matrix factorization)和局部上下文窗口方法(local context window methods)的优点。

本文模型训练只使用词-词共现矩阵中的非零元素(efficiently leverages statistical information by training only on the nonzero elements in a word-word co-occurrence matrix),模型生成的词向量空间具有语义结构(a vector space with meaningful substructure)。

1 引言

语义向量空间模型使用实值向量表示词条(semantic vector space models of language represent each word with a real-valued vector)。

词表示质量评价方法:词向量对之间的距离或角度(most word vector methods rely on the distance or angle between pairs of word vectors as the primary method for evaluating the intrinsic quality of such a set of word representations)

词向量(word vectors)的学习方法:(1)全局矩阵分解(global matrix factorization methods),如隐含语义分析(latent semantic analysis,LSA);(2)局部上下文窗口(local context window methods),如skip-gram。

全局矩阵分解能够充分利用统计信息(leverage statistical information),但在词类比任务(the word analogy task)上表现较差,即其向量空间结构非最优(a sub-optimal vector space structure);局部上下文窗口在词类比任务表现更好,但却忽视了语料库(corpus)的统计信息(poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts)

2 相关工作

矩阵分解(Matrix Factorization Methods):分解语料库统计信息矩阵(decompose large matrices that capture statistical information about a corpus),使用低秩近似(low-rank approximations)生成维单词表示(generating low-dimensional word representations)。

语料库统计信息矩阵组织形式分为:(1)词条-文档(term-document)类型,行对应词条、列对应文档(the rows correspond to words or terms, and the columns correspond to different documents in the corpus);(2)词条-词条(term-term)类型,行、列均对应词条,矩阵元素对应给定词在目标词上下文中出现的频次(the rows and columns correspond to words and the entries correspond to the number of times a given word occurs in the context of another given word)。

局部窗口(Shallow Window-Based Methods):学习在局部上下文窗口中预测的词表示(learn word representations that aid in making predictions within local context windows),如skip-gram和CBOW(continuous bag-of-words)、vLBL和ivLBL(closely-related vector log-bilinear models)。

skip-gram、ivLBL模型的目标为根据给定词预测上下文(predict a word’s context given the word itself);CBOW、vLBL模型的目标为根据上下文预测给定词(predict a word given its context)。

3 GloVe模型

语料库中词频统计信息(statistics of word occurrences in a corpus)是非监督单词表示学习(unsupervised methods for learning word representations)的主要信息源(source of information available),其核心问题在于:(1)如何根据统计信息生成词义(how meaning is generated from these statistics);(2)词向量如何表示词义(how the resulting word vectors might represent that meaning)。

GloVe模型:语料库全局统计信息(the global corpus statistics)词向量模型。

X\mathbf{X}:词条共现矩阵(the matrix of word-word co-occurrence counts),XijX_{ij}:词条jj出现在词条ii的上下文中的次数,Xi=kXikX_{i} = \sum_{k} X_{ik}Pij=P(ji)=XijXiP_{ij} = P(j | i) = \frac{X_{ij}}{X_{i}}表示词条jj出现在词条ii的上下文中的概率(the probability that word jj appear in the context of word ii)。

文献阅读 - GloVe: Global Vectors for Word Representation
由表(1)可知,使用共现概率比(ratios of co-occurrence probabilities)学习词向量应优于单纯使用概率,即

F(wi,wj,w~k)=PikPik(1)F( \mathbf{w}_{i}, \mathbf{w}_{j}, \tilde{\mathbf{w}}_{k} ) = \frac{P_{ik}}{P_{ik}} \tag {1}

其中,wRd\mathbf{w} \in \R^{d}表示词向量、w~Rd\tilde{\mathbf{w}} \in \R^{d}表示上下文词向量(context word vectors)

  1. 函数FF应对词向量空间(the word vector space)中表示比值PikPik\frac{P_{ik}}{P_{ik}}的信息编码(information present the ratio PikPik\frac{P_{ik}}{P_{ik}}),由于向量空间是线性的(vector spaces are inherently linear structures),因此可采用向量差形式(vector differences),
    F(wiwj,w~k)=PikPik(2)F( \mathbf{w}_{i} - \mathbf{w}_{j}, \tilde{\mathbf{w}}_{k} ) = \frac{P_{ik}}{P_{ik}} \tag {2}

  2. 由于方程(2)的右端为标量(a scalar),因此函数FF可采用点积形式(take the dot product of the arguments)
    F((wiwj)Tw~k)=PikPik(3)F \left( (\mathbf{w}_{i} - \mathbf{w}_{j})^{\text{T}} \tilde{\mathbf{w}}_{k} \right) = \frac{P_{ik}}{P_{ik}} \tag {3}

  3. 由于词条共现矩阵中的目标词条与上下文词条是任意的,即可相互交换(for word-word co-occurrence matrices, the distinction between a word and a context word is arbitrary and that we are free to exchange the two roles),ww~\mathbf{w} \leftrightarrow \tilde{\mathbf{w}}XXT\mathbf{X} \leftrightarrow \mathbf{X}^{\text{T}}。假设函数FF为群(R,+)(\R, +)(R>0,×)(\R_{\gt 0}, \times)间的同态(a homomorphism between the groups (R,+)(\R, +) and (R>0,×)(\R_{\gt 0}, \times)),即,
    F((wiwj)Tw~k)=F(wiTw~k)F(wjTw~k)(4)F \left( (\mathbf{w}_{i} - \mathbf{w}_{j})^{\text{T}} \tilde{\mathbf{w}}_{k} \right)= \frac{F \left( \mathbf{w}_{i}^{\text{T}} \tilde{\mathbf{w}}_{k} \right)}{F \left( \mathbf{w}_{j}^{\text{T}} \tilde{\mathbf{w}}_{k} \right)} \tag {4}
    根据方程(3)有
    F(wiTw~k)=Pik=XikXi(5)F \left( \mathbf{w}_{i}^{\text{T}} \tilde{\mathbf{w}}_{k} \right) = P_{ik} = \frac{X_{ik}}{X_{i}} \tag {5}
    F=expF = \exp满足方程(4),即
    wiTw~k=log(Pik)=log(Xik)log(Xi)(6)\mathbf{w}_{i}^{\text{T}} \tilde{\mathbf{w}}_{k} = \log(P_{ik}) = \log(X_{ik}) - \log(X_{i}) \tag {6}

  4. 方程(6)不满足交换对称性(exchange symmetry),因此为w~k\tilde{\mathbf{w}}_{k}添加偏置项b~k\tilde{b}_{k}
    wiTw~k+bi+b~k=log(Xik)(7)\mathbf{w}_{i}^{\text{T}} \tilde{\mathbf{w}}_{k} + b_{i} + \tilde{b}_{k} = \log(X_{ik}) \tag {7}

零输入会导致对数(logarithm)发散(diverge),因此方程(7)是病态的(ill-defined),可通过在对数项中添加偏移解决,log(Xik)log(Xik+1)\log(X_{ik}) \rightarrow \log(X_{ik} + 1)

该模型的主要缺点为:所有共现权值相同(it weighs all co-occurrences equally),而稀有共现通常为噪声且信息量极小(rare co-occurrences are noisy and carry less information than the more frequent ones)。本文进而提出加权最小二乘回归模型(a weighted least squares regression model)。在损失函数(cost function)中,引入权值函数(a weighting function)f(Xij)f(X_{ij})

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlog(Xij))2=0(8)\mathcal{J} = \sum_{i, j = 1}^{|V|} f(X_{ij}) \left( \mathbf{w}_{i}^{\text{T}} \tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} - \log(X_{ij}) \right)^{2} = 0 \tag {8}

其中,V|V|表示词典大小(the size of the vocabulary)。权值函数需满足:

  1. f(0)=0f(0) = 0,以保证limx0fxlog2x\lim_{x \rightarrow 0} f{x} \log^{2} x有界(finite);

  2. f(x)f(x)非减(non-decreasing),以保证稀有共现(rare co-occurrences)不会被过度加权(overweighted);

  3. xx很大时,f(x)f(x)不应过大,以保证频繁共现(frequent co-occurrences)不会被过度加权。

f(x)={(xxmax)α,if x<xmax1,otherwise(9)f(x) = \begin{cases} \left( \frac{x}{x_{\max}} \right)^{\alpha}, & \text{if } x \lt x_{\max} \\ 1, & \text{otherwise} \end{cases} \tag {9}

本文中,xmax=100x_{\max} = 100α=3/4\alpha = 3/4

文献阅读 - GloVe: Global Vectors for Word Representation

3.1 与现有模型的关系(relationship to other models)

3.2 模型复杂度(complexity of the model)

4 实验

4.1 评估方法(evaluation methods)

单词类比(word analogies)

单词相似度(word similarity)

命名实体识别(named entity recognition)

4.2 语料库及训练细节(corpora and training details)

4.3 结果

文献阅读 - GloVe: Global Vectors for Word Representation
文献阅读 - GloVe: Global Vectors for Word Representation
文献阅读 - GloVe: Global Vectors for Word Representation

4.4 模型分析:向量长度、上下文范围(model analysis: vector length and context size)

文献阅读 - GloVe: Global Vectors for Word Representation
对称上下文窗口(a symmetric context window):目标词(a target word)位于窗口中间

非对称上下文窗口(an asymmetric context window):目标词(a target word)位于窗口右侧(a context window that extends only to the left of a target word)

非对称小尺寸上下文窗口在语法任务(syntactic subtask)上表现较好,其原因在于语法信息(syntactic information)主要源自(mostly drawn from)直接上下文(immediate context)且与词序(word order)强相关;而语义信息(semantic information)通常是全局的(more frequently non-local),需要较大尺寸的窗口捕获。

4.5 模型分析:语料库规模(model analysis: corpus size)

文献阅读 - GloVe: Global Vectors for Word Representation
在语法任务上,模型表现随语料库规模增加而单调增加(a monotonic increase in performance as the corpus size increases),其原因在于大规模语料库能够生成更好的统计(better statistics)

在语义任务上,模型表现与语料库规模并非强相关,主要取决于语料库质量(the large number of city- and country- based analogies in the analogy dataset and the fact that Wikipedia has fairly comprehensive articles for most such locations)

4.6 模型分析:运行时间(model analysis: run-time)

文献阅读 - GloVe: Global Vectors for Word Representation

4.7 模型分析:与word2vec性能对比(model analysis: comparison with word2vec)

5 结论

致谢

相关文章:

  • 2021-12-28
  • 2021-10-11
  • 2021-07-27
  • 2021-08-30
  • 2021-11-18
  • 2022-01-09
  • 2021-05-09
  • 2021-09-03
猜你喜欢
  • 2021-12-20
  • 2021-11-10
  • 2021-12-10
  • 2021-08-31
  • 2021-12-22
  • 2021-11-28
  • 2022-12-23
相关资源
相似解决方案