使用 NLTK 生成二元组答案

【问题标题】：Generate bigrams with NLTK使用 NLTK 生成二元组
【发布时间】：2016-10-05 16:26:39
【问题描述】：

我正在尝试生成给定句子的二元列表，例如，如果我输入，

    To be or not to be

我要程序生成

     to be, be or, or not, not to, to be

我尝试了以下代码，但只是给了我

<generator object bigrams at 0x0000000009231360>

这是我的代码：

    import nltk
    bigrm = nltk.bigrams(text)
    print(bigrm)

那么我怎样才能得到我想要的呢？我想要一个像上面这样的单词组合的列表（to be, be or, or not, not to, to be）。

【问题讨论】：

试试：list(bigrm)
只是因为我喜欢代码：Here 是一个不错的独立于 NLTK 的 bigram-oneliner。

标签： python nltk n-gram

【解决方案1】：

nltk.bigrams() 返回二元组的迭代器（特别是生成器）。如果您想要一个列表，请将迭代器传递给list()。它还期望一系列项目从中生成二元组，因此您必须在传递之前拆分文本（如果您没有这样做的话）：

bigrm = list(nltk.bigrams(text.split()))

要打印出来并用逗号分隔，您可以（在 python 3 中）：

print(*map(' '.join, bigrm), sep=', ')

如果在 python 2 上，那么例如：

print ', '.join(' '.join((a, b)) for a, b in bigrm)

请注意，仅用于打印不需要生成列表，只需使用迭代器即可。

【讨论】：

【解决方案2】：

以下代码为给定句子生成bigram 列表

>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> text = "to be or not to be"
>>> tokens = nltk.word_tokenize(text)
>>> bigrm = nltk.bigrams(tokens)
>>> print(*map(' '.join, bigrm), sep=', ')
to be, be or, or not, not to, to be

【讨论】：

【解决方案3】：

很晚了，但这是另一种方式。

>>> from nltk.util import ngrams
>>> text = "I am batman and I like coffee"
>>> _1gram = text.split(" ")
>>> _2gram = [' '.join(e) for e in ngrams(_1gram, 2)]
>>> _3gram = [' '.join(e) for e in ngrams(_1gram, 3)]
>>> 
>>> _1gram
['I', 'am', 'batman', 'and', 'I', 'like', 'coffee']
>>> _2gram
['I am', 'am batman', 'batman and', 'and I', 'I like', 'like coffee']
>>> _3gram
['I am batman', 'am batman and', 'batman and I', 'and I like', 'I like coffee']

【讨论】：