NLTK 包估计（unigram）困惑答案

【问题标题】：NLTK package to estimate the (unigram) perplexityNLTK 包估计（unigram）困惑
【发布时间】：2016-01-20 21:38:03
【问题描述】：

我正在尝试计算我拥有的数据的困惑度。我使用的代码是：

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

但是我收到了错误，

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

我已经对我拥有的数据执行了潜在狄利克雷分配，并且我已经生成了一元组及其各自的概率（它们被归一化为数据的总概率之和为 1）。

我的一元组及其概率如下所示：

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

这只是我拥有的 unigrams 文件的一个片段。大约 1000 行遵循相同的格式。总概率（第二列）相加得出 1。

我是一名初出茅庐的程序员。这个 ngram.py 属于 nltk 包，我对如何纠正这一点感到困惑。我这里的示例代码来自 nltk 文档，我现在不知道该怎么做。请帮助我能做些什么。提前致谢！

【问题讨论】：

您首先说要计算文本语料库上一元模型的困惑度。但是现在你把unigram这个词删掉了。
nltk 的示例代码本身不起作用 :( 在示例代码中，它是一个三元组，如果它有效，我会将其更改为一元组。如何克服这个错误？
你必须使用 NLTK 吗？
不特别关注 NLTK。我只是觉得作为一个编程新手更容易使用。有没有其他方法或包可以用来估计我拥有的数据（不是棕色语料库）的困惑度？
当然有。我将假设您有一个简单的文本文件，您想从中构建一个 unigram 语言模型，然后计算该模型的困惑度。对吗？

标签： python-2.7 nlp nltk n-gram language-model

【解决方案1】：

Perplexity 是测试集的逆概率，由单词数归一化。对于一元组：

现在你说你已经构建了 unigram 模型，意思是，对于每个单词你都有相关的概率。然后你只需要应用公式。我假设你有一本大字典unigram[word]，它可以提供语料库中每个单词的概率。您还需要有一个测试集。如果您的 unigram 模型不是字典的形式，请告诉我您使用了什么数据结构，以便我可以相应地调整它以适应我的解决方案。

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

更新：

正如您要求一个完整的工作示例，这是一个非常简单的示例。

假设这是我们的语料库：

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

这是我们首先构建一元模型的方法：

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

我们这里的模型是平滑的。对于超出其知识范围的单词，它分配0.01 的低概率。我已经告诉过你如何计算困惑度：

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

现在我们可以在两个不同的测试集上进行测试：

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

你会得到以下结果：

>>> 
49.09452736318415
99.99999999999997

请注意，在处理困惑时，我们会尽量减少它。对于某个测试集具有较少困惑的语言模型比具有更大困惑的语言模型更可取。在第一个测试集中，单词Monty被包含在unigram模型中，因此perplexity的相应数字也较小。

【讨论】：

您能否为上述代码提供一个示例输入并给出它的输出？对我来说，相应地制定我的数据会更容易。我已经通过在我的输入文件中添加应该计算困惑度的一元组及其概率来编辑问题。
嘿！但是，我还必须包括对数似然性，例如 perplexity (test set) = exp{- (Loglikelihood/count of tokens)} ？ qpleple.com/perplexity-to-evaluate-topic-models
非常感谢您的时间和代码。我会试试看。我必须计算 LDA 模型生成的一元组的困惑度。我想对于我拥有的数据，我可以使用此代码并检查一下。非常感谢！
model[word] = model[word]/float(len(model))这一行中的模型构造是不是有错误——不应该说model[word] = model[word]/float(sum(model.values()))吗？
在model[word]/float(sum(model.values()))这一行中，每次归一化模型值更新后都会计算sum(model.values())。因此，归一化值的总和不是 1，而是 3.4。总和必须计算一次并在 for 循环中使用。 @heiner 确实是对的，我看不出答案在哪里。

【解决方案2】：

感谢代码 sn-p！不应该：

for word in model:
        model[word] = model[word]/float(sum(model.values()))

宁可：

v = float(sum(model.values()))
for word in model:
        model[word] = model[word]/v

哦...我看到已经回答了...

【讨论】：

嗨 Heiner，欢迎来到 SO，因为您已经注意到这个问题在几年前已经得到了很好的回答，为已经回答的问题添加更多答案没有问题，但您可能想要为了确保他们增加了足够的价值来保证他们，在这种情况下，您可能需要考虑专注于回答 these new questions！