【发布时间】:2017-12-05 21:43:41
【问题描述】:
我使用 Python 中的 gensim 包来加载预训练的 Google word2vec 数据集。然后我想使用 k-means 在我的词向量上找到有意义的集群,并找到每个集群的代表词。我正在考虑使用其对应向量最接近集群质心的词来表示该集群,但不知道这是否是一个好主意,因为我的实验没有给我很好的结果。
我的示例代码如下:
import gensim
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min
model = gensim.models.KeyedVectors.load_word2vec_format('/home/Desktop/GoogleNews-vectors-negative300.bin', binary=True)
K=3
words = ["ship", "car", "truck", "bus", "vehicle", "bike", "tractor", "boat",
"apple", "banana", "fruit", "pear", "orange", "pineapple", "watermelon",
"dog", "pig", "animal", "cat", "monkey", "snake", "tiger", "rat", "duck", "rabbit", "fox"]
NumOfWords = len(words)
# construct the n-dimentional array for input data, each row is a word vector
x = np.zeros((NumOfWords, model.vector_size))
for i in range(0, NumOfWords):
x[i,]=model[words[i]]
# train the k-means model
classifier = MiniBatchKMeans(n_clusters=K, random_state=1, max_iter=100)
classifier.fit(x)
# check whether the words are clustered correctly
print(classifier.predict(x))
# find the index and the distance of the closest points from x to each class centroid
close = pairwise_distances_argmin_min(classifier.cluster_centers_, x, metric='euclidean')
index_closest_points = close[0]
distance_closest_points = close[1]
for i in range(0, K):
print("The closest word to the centroid of class {0} is {1}, the distance is {2}".format(i, words[index_closest_points[i]], distance_closest_points[i]))
输出如下:
[2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
The closest word to the centroid of class 0 is rabbit, the distance is 1.578625818679259
The closest word to the centroid of class 1 is fruit, the distance is 1.8351978219013796
The closest word to the centroid of class 2 is car, the distance is 1.6586030662247868
在代码中,我有 3 类词:车辆、水果和动物。从输出中我们可以看到,k-means 对所有 3 个类别的词进行了正确聚类,但是使用质心方法导出的代表词不是很好,至于第 0 类我想看到“动物”,但它给出了“兔子” ,对于第 2 类,我想看到“车辆”,但它返回“汽车”。
任何帮助或建议为每个集群找到好的代表词将不胜感激。
【问题讨论】:
标签: python k-means gensim word2vec