【发布时间】:2014-11-07 21:02:14
【问题描述】:
我将各种文章与 Scikit-learn 框架聚集在一起。以下是每个集群中排名前 15 的单词:
Cluster 0: whales islands seaworld hurricane whale odile storm tropical kph mph pacific mexico orca coast cabos
Cluster 1: ebola outbreak vaccine africa usaid foundation virus cdc gates disease health vaccines experimental centers obama
Cluster 2: jones bobo sanford children carolina mississippi alabama lexington bodies crumpton mccarty county hyder tennessee sheriff
Cluster 3: isis obama iraq syria president isil airstrikes islamic li strategy terror military war threat al
Cluster 4: yosemite wildfire park evacuation dome firefighters blaze hikers cobb helicopter backcountry trails homes california evacuate
我像这样创建“词袋”矩阵:
hasher = TfidfVectorizer(max_df=0.5,
min_df=2, stop_words='english',
use_idf=1)
vectorizer = make_pipeline(hasher, TfidfTransformer())
# document_text_list is a list of all text in a given article
X_train_tfidf = vectorizer.fit_transform(document_text_list)
然后像这样运行 KMeans:
km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
verbose=0, n_clusters=25)
km.fit(X_train_tfidf)
我正在像这样打印出集群:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = hasher.get_feature_names()
for i in range(25):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :15]:
print(' %s' % terms[ind], end='')
print()
但是,我想知道如何确定哪些文档都属于同一个集群,理想情况下,它们各自到质心(集群)中心的距离。
我知道生成的矩阵(X_train_tfidf)的每一行对应一个文档,但是在执行 KMeans 算法之后没有明显的方法可以取回这些信息。我将如何使用 scikit-learn 做到这一点?
X_train_tfidf 看起来像:
X_train_tfidf: (0, 4661) 0.0405014425985
(0, 19271) 0.0914545222775
(0, 20393) 0.287636818634
(0, 56027) 0.116893929188
(0, 30872) 0.137815327338
(0, 35256) 0.0343461345507
(0, 31291) 0.209804679792
(0, 66008) 0.0643776635222
(0, 3806) 0.0967713285061
(0, 66338) 0.0532881852791
(0, 65023) 0.0702918299573
(0, 41785) 0.197672720592
(0, 29774) 0.120772893833
(0, 61409) 0.0268609667042
(0, 55527) 0.134102682463
(0, 40011) 0.0582437010271
(0, 19667) 0.0234843097048
(0, 51667) 0.128270976476
(0, 52791) 0.57198926651
(0, 15014) 0.149195054799
(0, 18805) 0.0277497826525
(0, 35939) 0.170775938672
(0, 5808) 0.0473913910636
(0, 24922) 0.0126531527875
(0, 10346) 0.0200098997901
: :
(23945, 56927) 0.0595132327966
(23945, 23259) 0.0100977769025
(23945, 12515) 0.0482102583442
(23945, 49709) 0.210139450446
(23945, 28742) 0.0190221880312
(23945, 16628) 0.137692798005
(23945, 53424) 0.157029848335
(23945, 30647) 0.104485375827
(23945, 57512) 0.0569754813269
(23945, 39389) 0.0158180459761
(23945, 26093) 0.0153713768922
(23945, 9787) 0.0963777149738
(23945, 23260) 0.158336452835
(23945, 50595) 0.0527243936945
(23945, 42447) 0.0527515904547
(23945, 2829) 0.0351677269698
(23945, 2832) 0.0175929392039
(23945, 52079) 0.0849796887889
(23945, 13523) 0.0878730969786
(23945, 57849) 0.133869666381
(23945, 25064) 0.128424780903
(23945, 31129) 0.0919760384953
(23945, 65601) 0.0388718258746
(23945, 1428) 0.391477289626
(23945, 2152) 0.655211469073
X_train_tfidf shape: (23946, 67816)
回应 ttttthomassss 的回答:
当我尝试运行以下命令时:
X_cluster_0 = X_train_tfidf[cluster_0]
我得到错误:
File "cluster.py", line 52, in main
X_cluster_0 = X_train_tfidf[cluster_0]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/csr.py", line 226, in __getitem__
col = key[1]
IndexError: tuple index out of range
看cluster_0的结构:
(array([ 858, 2012, 2256, 2762, 2920, 3770, 6052, 6174, 8296,
9494, 9966, 10085, 11914, 12117, 12633, 12727, 12993, 13527,
13754, 14186, 14669, 14713, 14973, 15071, 15157, 15208, 15926,
16300, 16301, 17138, 17556, 17775, 18236, 19057, 20106, 21014, 21080]),)
这是一个元组结构,内容位于第 0 位,因此我将行更改为以下内容:
X_cluster_0 = X_train_tfidf[cluster_0[0]]
我正在从数据库中提取“文档”,我可以轻松地从中获取索引(迭代提供的数组,直到找到相应的文档[当然假设 scikit 不会改变矩阵中文档的顺序])。所以我不明白X_cluster_0 到底代表什么。 X_cluster_0 具有以下结构:
X_cluster_0: (0, 42726) 0.741747456202
(0, 13535) 0.115880661286
(0, 17447) 0.117608794277
(0, 44849) 0.414829246262
(0, 14574) 0.10214258736
(0, 17317) 0.0634383214735
(0, 17935) 0.0591234431875
: :
(17, 33867) 0.0174155914371
(17, 48916) 0.0227046046275
(17, 59132) 0.0168864861723
(17, 40860) 0.0485813219503
(17, 63725) 0.0271415763987
(18, 45019) 0.490135684209
(18, 36168) 0.14595160766
(18, 52304) 0.139590524213
(18, 63586) 0.16501953796
(18, 28709) 0.15075416279
(18, 11495) 0.0926490431993
(18, 40860) 0.124236878928
计算到质心的距离
当前运行建议的代码 (distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])) 会导致以下错误:
File "cluster.py", line 68, in main
distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/spatial/distance.py", line 211, in euclidean
dist = norm(u - v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 197, in __sub__
raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported
这是km.cluster_centers 的样子:
km.cluster_centers: [ 9.47080802e-05 2.53907413e-03 0.00000000e+00 ..., 0.00000000e+00
0.00000000e+00 0.00000000e+00]
我想我现在遇到的问题是如何提取矩阵的第 i 项(假设从左到右遍历矩阵)。我指定的任何级别的索引嵌套都没有区别(即X_cluster_0[0]、X_cluster_0[0][0] 和X_cluster_0[0][0][0] 都给我上面描述的相同的打印输出矩阵结构)。
【问题讨论】:
标签: python artificial-intelligence scikit-learn k-means