如何在 Python 中使用 n-gram 重叠对句子进行聚类？答案

【问题标题】：How to cluster sentences by using n-gram overlap in Python?如何在 Python 中使用 n-gram 重叠对句子进行聚类？
【发布时间】：2019-05-22 10:25:51
【问题描述】：

我需要根据它们包含的常见 n-gram 对句子进行聚类。我可以使用 nltk 轻松提取 n-gram，但我不知道如何基于 n-gram 重叠执行聚类。这就是为什么我不能写出如此真实的代码，首先我很抱歉。我写了 6 个简单的句子和预期的输出来说明问题。

import nltk

Sentences= """I would like to eat pizza with her.
She would like to eat pizza with olive.
There are some sentences must be clustered.
These sentences must be clustered according to common trigrams.
The quick brown fox jumps over the lazy dog.
Apples are red, bananas are yellow."""

sent_detector = nltk.data.load('tokenizers/punkt/'+'English'+'.pickle')
sentence_tokens = sent_detector.tokenize(sentences.strip())

mytrigrams=[]
for sentence in sentence_tokens:
    trigrams=ngrams(sentence.lower().split(), 3)
    mytrigrams.append(list(trigrams))

在这之后我不知道（我什至不确定这部分是否可以。）如何根据常见的三元组对它们进行聚类。我尝试使用itertools-combinations，但我迷路了，我不知道如何生成多个集群，因为如果不将每个句子相互比较，就无法知道集群的数量。预期输出如下，提前感谢您的帮助。

Cluster1: 'I would like to eat pizza with her.'
          'She would like to eat pizza with olive.'

Cluster2: 'There are some sentences must be clustered.' 
          'These sentences must be clustered according to common trigrams.'

Sentences do not belong to any cluster:                                
          'The quick brown fox jumps over the lazy dog.'
          'Apples are red, bananas are yellow.'

编辑：我又尝试了一次combinations，但它根本没有聚类，只是返回了所有句子对。（显然我做错了什么）。

from itertools import combinations

new_dict = {k: v for k, v in zip(sentence_tokens, mytrigrams)}

common=[] 
no_cluster=[]   
sentence_pairs=combinations(new_dict.keys(), 2)

for keys, values in new_dict.items():

    for values in sentence_pairs:
        sentence1= values[0]
        sentence2= values[1]
        #print(sentence1, sentence2)
        if len(set(sentence1) & set(sentence2))!=0:
            common.append((sentence1, sentence2))
        else:
            no_cluster.append((sentence1, sentence2))


print(common)

但即使这段代码有效，它也不会给出我期望的输出，因为我不知道如何基于常见的 n-gram 生成多个集群

【问题讨论】：

标签： python nlp n-gram

【解决方案1】：

为了更好地了解您的问题，您可以解释目的和预期结果。
使用 Ngrams 必须非常小心，使用 ngrams 时，您会增加数据集的维数。
我建议你先使用 TD-IDF，然后才在没有达到最低命中率的情况下使用 n-gram。
如果您能更好地解释您的问题，我可以看看是否可以帮助您。

【讨论】：

感谢您的评论，但您应该将其写为评论。这不是一个答案。我已经尝试过使用 TF-IDF 方式。那没有给我想要的。如果有办法，我想学习如何去做。我认为我写的问题描述得很好（期待我自己的代码，因为我根本不会写）