【发布时间】:2019-07-26 03:41:55
【问题描述】:
我给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型来计算测试数据的困惑度。
这是我的代码:
import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk import word_tokenize, sent_tokenize
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
text = fin.read()
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = Laplace(n)
model.fit(train_data, padded_sents)
print(model.perplexity(trainTest))
当我使用 n=1(一元组)运行此代码时,我得到 "1068.332393940235"。对于 n=2 或二元组,我得到 "1644.3441077259993",对于三元组,我得到 2552.2085752565313。
它有什么问题?
【问题讨论】:
标签: python machine-learning nltk