nltk 语言模型 (ngram) 从上下文中计算单词的概率答案

【问题标题】：nltk language model (ngram) calculate the prob of a word from contextnltk 语言模型 (ngram) 从上下文中计算单词的概率
【发布时间】：2011-06-24 02:28:48
【问题描述】：

我正在使用 Python 和 NLTK 构建语言模型如下：

from nltk.corpus import brown
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), estimator)
# Thanks to miku, I fixed this problem
print lm.prob("word", ["This is a context which generates a word"])
>> 0.00493261081006
# But I got another program like this one...
print lm.prob("b", ["This is a context which generates a word"])

但它似乎不起作用。结果如下：

>>> print lm.prob("word", "This is a context which generates a word")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
    return self._alpha(context) * self._backoff.prob(word, context[1:])
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 79, in prob
    return self._alpha(context) * self._backoff.prob(word, context[1:])
  File "/usr/local/lib/python2.6/dist-packages/nltk/model/ngram.py", line 82, in prob
    "context %s" % (word, ' '.join(context)))
TypeError: not all arguments converted during string formatting

谁能帮帮我？谢谢！

【问题讨论】：

我不确定如何在 nltk 中导入 Ngram 模型。你能帮帮我吗？
同样的问题，我认为是旧版本
@RikenShah from nltk import ngrams 对我有用，但估计器和其他的似乎也不同

标签： python nlp nltk

【解决方案1】：

我知道这个问题很老，但每次我用谷歌搜索 nltk 的 NgramModel 类时都会弹出它。 NgramModel 的 prob 实现有点不直观。问者一头雾水。据我所知，答案不是很好。由于我不经常使用 NgramModel，这意味着我很困惑。没有了。

源代码位于此处：https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py。下面是 NgramModel 的 prob 方法的定义：

def prob(self, word, context):
    """
    Evaluate the probability of this word in this context using Katz Backoff.

    :param word: the word to get the probability of
    :type word: str
    :param context: the context the word is in
    :type context: list(str)
    """

    context = tuple(context)
    if (context + (word,) in self._ngrams) or (self._n == 1):
        return self[context].prob(word)
    else:
        return self._alpha(context) * self._backoff.prob(word, context[1:])

(注意：'self[context].prob(word) 等价于'self._model[context].prob(word)')

好的。现在至少我们知道要寻找什么了。上下文需要是什么？我们来看一段构造函数的摘录：

for sent in train:
    for ngram in ingrams(chain(self._lpad, sent, self._rpad), n):
        self._ngrams.add(ngram)
        context = tuple(ngram[:-1])
        token = ngram[-1]
        cfd[context].inc(token)

if not estimator_args and not estimator_kwargs:
    self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
    self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)

好的。构造函数根据条件频率分布创建条件概率分布 (self._model)，其“上下文”是一元元组。这告诉我们“上下文”应该不是一个字符串或一个包含单个多字字符串的列表。 'context' 必须是包含 unigrams 的可迭代对象。事实上，要求更严格一些。这些元组或列表的大小必须为 n-1。这样想吧。你告诉它是一个三元模型。你最好给它适当的三元组上下文。

让我们用一个更简单的例子来看看这个：

>>> import nltk
>>> obs = 'the rain in spain falls mainly in the plains'.split()
>>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
>>> lm.prob('rain', 'the') #wrong
0.0
>>> lm.prob('rain', ['the']) #right
0.5
>>> lm.prob('spain', 'rain in') #wrong
0.0
>>> lm.prob('spain', ['rain in']) #wrong
'''long exception'''
>>> lm.prob('spain', ['rain', 'in']) #right
1.0

（作为旁注，实际上尝试使用 MLE 作为 NgramModel 中的估计器做任何事情是一个坏主意。事情会分崩离析。我保证。）

至于最初的问题，我想我对 OP 想要什么的最佳猜测是：

print lm.prob("word", "generates a".split())
print lm.prob("b", "generates a".split())

...但是这里发生了太多的误解，我无法判断他到底想做什么。

【讨论】：

【解决方案2】：

快速修复：

print lm.prob("word", ["This is a context which generates a word"])
# => 0.00493261081006

【讨论】：

但我遇到了另一个问题...为什么 print lm.prob("word", ["word"]), print lm.prob("word", ["word word word"] ), print lm.prob("word", ["this"]) 都产生完全相同的概率？都是 0.00493261081006...
@Austin，抱歉，我的时间不多，所以我现在不能详细说明——也许以后再说。

【解决方案3】：

关于您的第二个问题：这是因为 "b" 没有出现在布朗语料库类别 news 中，您可以通过以下方式验证：

>>> 'b' in brown.words(categories='news')
False

而

>>> 'word' in brown.words(categories='news')
True

我承认错误消息非常神秘，因此您可能需要向 NLTK 作者提交错误报告。

【讨论】：

谢谢！我同意错误不应以这种方式发生，因此我将向 NLTK 提交错误报告。还是谢谢。

【解决方案4】：

我会暂时远离 NLTK 的 NgramModel。当前存在一个平滑错误，导致模型在 n>1 时大大高估了可能性。如果你最终使用了 NgramModel，你绝对应该在这里应用 git 问题跟踪器中提到的修复：https://github.com/nltk/nltk/issues/367

【讨论】：

您会推荐使用哪个 Python 库？尤其要记住，当前版本的 NLTK (3.3) 不再具有 NGramModel。