如何找到有意义的词来表示从 word2vec 向量派生的每个 k-means 集群？答案

【问题标题】：How to find the meaningful word to represent each k-means cluster derived from word2vec vectors?如何找到有意义的词来表示从 word2vec 向量派生的每个 k-means 集群？
【发布时间】：2017-12-05 21:43:41
【问题描述】：

我使用 Python 中的 gensim 包来加载预训练的 Google word2vec 数据集。然后我想使用 k-means 在我的词向量上找到有意义的集群，并找到每个集群的代表词。我正在考虑使用其对应向量最接近集群质心的词来表示该集群，但不知道这是否是一个好主意，因为我的实验没有给我很好的结果。

我的示例代码如下：

import gensim
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min

model = gensim.models.KeyedVectors.load_word2vec_format('/home/Desktop/GoogleNews-vectors-negative300.bin', binary=True)  

K=3

words = ["ship", "car", "truck", "bus", "vehicle", "bike", "tractor", "boat",
       "apple", "banana", "fruit", "pear", "orange", "pineapple", "watermelon",
       "dog", "pig", "animal", "cat", "monkey", "snake", "tiger", "rat", "duck", "rabbit", "fox"]
NumOfWords = len(words)

# construct the n-dimentional array for input data, each row is a word vector
x = np.zeros((NumOfWords, model.vector_size))
for i in range(0, NumOfWords):
    x[i,]=model[words[i]] 

# train the k-means model
classifier = MiniBatchKMeans(n_clusters=K, random_state=1, max_iter=100)
classifier.fit(x)

# check whether the words are clustered correctly
print(classifier.predict(x))

# find the index and the distance of the closest points from x to each class centroid
close = pairwise_distances_argmin_min(classifier.cluster_centers_, x, metric='euclidean')
index_closest_points = close[0]
distance_closest_points = close[1]

for i in range(0, K):
    print("The closest word to the centroid of class {0} is {1}, the distance is {2}".format(i, words[index_closest_points[i]], distance_closest_points[i]))

输出如下：

[2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
The closest word to the centroid of class 0 is rabbit, the distance is 1.578625818679259
The closest word to the centroid of class 1 is fruit, the distance is 1.8351978219013796
The closest word to the centroid of class 2 is car, the distance is 1.6586030662247868

在代码中，我有 3 类词：车辆、水果和动物。从输出中我们可以看到，k-means 对所有 3 个类别的词进行了正确聚类，但是使用质心方法导出的代表词不是很好，至于第 0 类我想看到“动物”，但它给出了“兔子” ，对于第 2 类，我想看到“车辆”，但它返回“汽车”。

任何帮助或建议为每个集群找到好的代表词将不胜感激。

【问题讨论】：

标签： python k-means gensim word2vec

【解决方案1】：

听起来您希望能够通过自动化过程为集群中的单词找到一个通用术语（类似于 hypernym），并希望质心是那个词。

不幸的是，我没有看到任何声称 word2vec 会以这种方式排列单词。单词确实倾向于与可以填充它们的其他单词接近——但实际上并不能保证所有共享类型的单词都比其他类型的单词更接近，或者下位词往往是等距的到他们的下位词，等等。（考虑到 word2vec 在类比解决方面的成功，上位词往往会在跨类之间以模糊相似的方向偏离其下位词，这当然是可能的。也就是说，也许模糊地'volkswagen' + ('animal' - 'dog') ~ 'car' - 虽然我没有' t检查。）

有时对可能相关的词向量进行了一个有趣的观察：对于具有更广泛含义的词（例如多种含义）的词向量，其原始形式的量级通常比其他词向量低对于具有更单一含义的单词。通常最相似的计算忽略幅度，只是比较原始方向，但搜索更通用的术语可能希望有利于幅度较低的向量。但这也只是我没有检查过的猜测。

您可以查找有关自动上位词/下位词发现的工作，并且 word2vec 向量可能是此类发现过程的一个促成因素——要么以正常方式进行训练，要么具有一些新的皱纹以试图强制进行所需的排列。（但是，开箱即用的 gensim 通常不支持此类专业化。）

经常有论文改进 word2vec 训练过程，以使向量更好地用于特定目的。 Facebook Research 最近发表的一篇似乎相关的论文是“Poincaré Embeddings for Learning Hierarchical Representations”——它报告了对层次结构的更好建模，并专门对 WordNet 的名词上位词图进行了测试。

【讨论】：

非常感谢 gojomo 的回复。我将检查有关自动上位词/下位词发现的文档。你知道任何可以进行上位词/下位词发现的 R 或 Python 包吗？