【发布时间】:2018-09-20 19:11:11
【问题描述】:
我有一个文档列表,我想知道它们与某个文档的相似度有多接近。我只是想出了如何对标记化文档进行聚类,但我不知道如何检查它们与 target 文档的距离。
我实现聚类的方式是,我先取文档列表...
text = [
"This is a test",
"This is something else",
"This is also a test"
]
然后我使用以下函数对它们进行标记...
def word_tokenizer(sentences):
tokens = word_tokenize(sentences)
stemmer = PorterStemmer()
tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
return tokens
我将此函数传递给TfidfVectorizer...
tfidf_vect = TfidfVectorizer(
tokenizer=word_tokenizer,
max_df=0.9,
min_df=0.1,
lowercase=True
)
tfidf_matrix = tfidf_vect.fit_transform(text)
然后我使用Kmeans 对矩阵进行聚类...
kmeans = KMeans(n_clusters=3)
kmeans.fit(tfidf_matrix)
然后我保存每个集群并打印出结果...
for i, label in enumerate(kmeans.labels_):
clusters[label].append(i)
res = dict(clusters)
for cluster in range(3):
print("cluster ", cluster, ":")
for i, sentence in enumerate(res[cluster]):
print("\tsentence ", i, ": ", text[sentence])
结果如下……
cluster 0 :
sentence 0 : This is also a test
cluster 1 :
sentence 0 : This is something else
cluster 2 :
sentence 0 : This is a test
这是有用的信息,但是假设我有一个目标文档,我想看看这些文档与目标的相似程度,我该怎么做?
例如,假设我有以下目标...
target = ["This is target"]
如何查看text 中的每个文档与此目标的相似程度?
【问题讨论】:
标签: python-3.x machine-learning scikit-learn cluster-analysis unsupervised-learning