【发布时间】:2019-05-22 10:25:51
【问题描述】:
我需要根据它们包含的常见 n-gram 对句子进行聚类。我可以使用 nltk 轻松提取 n-gram,但我不知道如何基于 n-gram 重叠执行聚类。这就是为什么我不能写出如此真实的代码,首先我很抱歉。我写了 6 个简单的句子和预期的输出来说明问题。
import nltk
Sentences= """I would like to eat pizza with her.
She would like to eat pizza with olive.
There are some sentences must be clustered.
These sentences must be clustered according to common trigrams.
The quick brown fox jumps over the lazy dog.
Apples are red, bananas are yellow."""
sent_detector = nltk.data.load('tokenizers/punkt/'+'English'+'.pickle')
sentence_tokens = sent_detector.tokenize(sentences.strip())
mytrigrams=[]
for sentence in sentence_tokens:
trigrams=ngrams(sentence.lower().split(), 3)
mytrigrams.append(list(trigrams))
在这之后我不知道(我什至不确定这部分是否可以。)如何根据常见的三元组对它们进行聚类。我尝试使用itertools-combinations,但我迷路了,我不知道如何生成多个集群,因为如果不将每个句子相互比较,就无法知道集群的数量。预期输出如下,提前感谢您的帮助。
Cluster1: 'I would like to eat pizza with her.'
'She would like to eat pizza with olive.'
Cluster2: 'There are some sentences must be clustered.'
'These sentences must be clustered according to common trigrams.'
Sentences do not belong to any cluster:
'The quick brown fox jumps over the lazy dog.'
'Apples are red, bananas are yellow.'
编辑:我又尝试了一次combinations,但它根本没有聚类,只是返回了所有句子对。 (显然我做错了什么)。
from itertools import combinations
new_dict = {k: v for k, v in zip(sentence_tokens, mytrigrams)}
common=[]
no_cluster=[]
sentence_pairs=combinations(new_dict.keys(), 2)
for keys, values in new_dict.items():
for values in sentence_pairs:
sentence1= values[0]
sentence2= values[1]
#print(sentence1, sentence2)
if len(set(sentence1) & set(sentence2))!=0:
common.append((sentence1, sentence2))
else:
no_cluster.append((sentence1, sentence2))
print(common)
但即使这段代码有效,它也不会给出我期望的输出,因为我不知道如何基于常见的 n-gram 生成多个集群
【问题讨论】: