python - 如何从无监督的文本分类中提取 id答案

【问题标题】：python - How do I extract the id from an unsupervised text classificationpython - 如何从无监督的文本分类中提取 id
【发布时间】：2019-04-27 14:08:12
【问题描述】：

所以我有以下数据框：

id     text
342    text sample
341    another text sample
343    ...

还有如下代码：

X = tfidf_vectorizer.fit_transform(df['text']).todense()
pca = PCA(n_components=2)
data2D = pca.fit_transform(X)
clusterer = KMeans(n_clusters=n_clusters), random_state=10)
cluster_labels = clusterer.fit_predict(data2D)
silhouette_avg = silhouette_score(data2D, cluster_labels)
print(silhouette_avg)
y_lower = 10
for i in range(n_clusters):
    # here I would like to get the id's of each item per cluster
    # so that I know which list of id's falls into which cluster

现在，我怎样才能看到哪个 id 属于哪个集群，这是可以做到的吗？为了“聚类”这些文本文档，我的方法是否正确？

请不要说我可能跳过了一些代码以保持问题简短

【问题讨论】：

标签： python-3.x k-means pca text-classification unsupervised-learning

【解决方案1】：

有many ways to perform document classification。 K-Means 是一种方法。通过查看数据和用例并探索其他方法，不可能说您正在做的事情是最好的。

如果您想坚持使用 KMeans，我建议您再阅读 scikit-learn 网站上的文档。您将在示例中注意到如何通过调用 fit 分类器上的 labels_ 属性来获取每个点的预测类标签（注意：不是您当前拥有的 fit_transform 的结果）。

【讨论】：