【问题标题】:NLTK Perplexity measure inversionNLTK 复杂度度量反演
【发布时间】:2019-07-26 03:41:55
【问题描述】:

我给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型来计算测试数据的困惑度。

这是我的代码:

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize 

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n) 
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest)) 

当我使用 n=1(一元组)运行此代码时,我得到 "1068.332393940235"。对于 n=2 或二元组,我得到 "1644.3441077259993",对于三元组,我得到 2552.2085752565313

它有什么问题?

【问题讨论】:

    标签: python machine-learning nltk


    【解决方案1】:

    您创建测试数据的方式错误(小写训练数据,但测试数据未转换为小写。测试数据中缺少开始和结束标记)。试试这个

    import os
    import requests
    import io #codecs
    from nltk.util import everygrams
    from nltk.lm.preprocessing import pad_both_ends
    from nltk.lm.preprocessing import padded_everygram_pipeline
    from nltk.lm import Laplace
    from nltk import word_tokenize, sent_tokenize 
    
    """
    fileTest = open("AaronPressman.txt","r");
    with io.open('AaronPressman.txt', encoding='utf8') as fin:
            textTest = fin.read()
    if os.path.isfile('AaronPressmanEdited.txt'):
        with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
            text = fin.read()
    """
    textTest = "This is an ant. This is a cat"
    text = "This is an orange. This is a mango"
    
    n = 2
    # Tokenize the text.
    tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                    for sent in sent_tokenize(text)]
    train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
    
    tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                    for sent in sent_tokenize(textTest)]
    test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
    
    model = Laplace(1) 
    model.fit(train_data, padded_sents)
    
    s = 0
    for i, test in enumerate(test_data):
        p = model.perplexity(test)
        s += p
    
    print ("Perplexity: {0}".format(s/(i+1)))
    

    【讨论】:

    • 谢谢,但我想你错过了一些小事,你应该将 n 传递给 lapalce 函数,对吗?
    • 你能解释一下为什么你总结了所有的困惑吗?这是正确的吗?
    • 传递给lapalce的值是平滑参数,通常> 0。它与n(unigram或bigram或ngram)无关。我们正在计算每个测试语句的 perplexity(因为 perplexity 方法只接受单个生成器而不是生成器列表)并最终对它们进行平均(如您在 print 语句中所见)。所以困惑度 = s/(i+1)
    • 好的,所以如果我想使用 MLE,例如我必须通过 n,对吗?
    • 必须通过 MLE 最高 ngram(即您的情况下为 n)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-03-06
    • 1970-01-01
    • 2018-02-04
    • 2016-02-13
    • 2021-04-22
    • 2012-08-14
    • 2018-04-05
    相关资源
    最近更新 更多