NLTK 复杂度度量反演答案

【问题标题】：NLTK Perplexity measure inversionNLTK 复杂度度量反演
【发布时间】：2019-07-26 03:41:55
【问题描述】：

我给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型来计算测试数据的困惑度。

这是我的代码：

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize 

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n) 
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest))

当我使用 n=1（一元组）运行此代码时，我得到 "1068.332393940235"。对于 n=2 或二元组，我得到 "1644.3441077259993"，对于三元组，我得到 2552.2085752565313。

它有什么问题？

【问题讨论】：

标签： python machine-learning nltk

【解决方案1】：

您创建测试数据的方式错误（小写训练数据，但测试数据未转换为小写。测试数据中缺少开始和结束标记）。试试这个

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize 

"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"

n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = Laplace(1) 
model.fit(train_data, padded_sents)

s = 0
for i, test in enumerate(test_data):
    p = model.perplexity(test)
    s += p

print ("Perplexity: {0}".format(s/(i+1)))

【讨论】：

谢谢，但我想你错过了一些小事，你应该将 n 传递给 lapalce 函数，对吗？
你能解释一下为什么你总结了所有的困惑吗？这是正确的吗？
传递给lapalce的值是平滑参数，通常> 0。它与n（unigram或bigram或ngram）无关。我们正在计算每个测试语句的 perplexity（因为 perplexity 方法只接受单个生成器而不是生成器列表）并最终对它们进行平均（如您在 print 语句中所见）。所以困惑度 = s/(i+1)
好的，所以如果我想使用 MLE，例如我必须通过 n，对吗？
必须通过 MLE 最高 ngram（即您的情况下为 n）