如何在 Python 中使用 NLTK NaiveBayesClassifier 训练和测试模型后预测情绪？答案

【问题标题】：How to predict Sentiments after training and testing the model by using NLTK NaiveBayesClassifier in Python?如何在 Python 中使用 NLTK NaiveBayesClassifier 训练和测试模型后预测情绪？
【发布时间】：2020-03-16 21:34:56
【问题描述】：

我正在使用 NLTK NaiveBayesClassifier 进行情绪分类。我用标记的数据训练和测试了模型。现在我想预测未标记数据的情绪。但是，我遇到了错误。给出错误的行是：

score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))

错误是：

ValueError: 没有足够的值来解包（预期 2，得到 1）

下面是代码：

import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
new_data = pd.read_csv("Japan Data.csv", header=0)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])

from unidecode import unidecode
from nltk import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import extract_unigram_feats

TRAINING_COUNT = 350


def clean_text(text):
    text = text.replace("<br />", " ")

    return text


analyzer = SentimentAnalyzer()
vocabulary = analyzer.all_words([(word_tokenize(unidecode(clean_text(instance))))
                                 for instance in train_x[:TRAINING_COUNT]])
print("Vocabulary: ", len(vocabulary))

print("Computing Unigran Features ...")

unigram_features = analyzer.unigram_word_feats(vocabulary, min_freq=10)

print("Unigram Features: ", len(unigram_features))

analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_features)

# Build the training set
_train_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
                                    for instance in train_x[:TRAINING_COUNT]], labeled=False)

# Build the test set
_test_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
                                   for instance in test_x], labeled=False)

trainer = NaiveBayesClassifier.train
classifier = analyzer.train(trainer, zip(_train_X, train_y[:TRAINING_COUNT]))

score = analyzer.evaluate(list(zip(_test_X, test_y)))
print("Accuracy: ", score['Accuracy'])

score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
print(score_1)

我知道问题的出现是因为我必须给出两个参数是给出错误但我不知道该怎么做的行。

提前致谢。

【问题讨论】：

标签： nltk python-3.7 sentiment-analysis predict naivebayes

【解决方案1】：

文档和示例

给出错误的那一行调用了 SentimentAnalyzer.evaluate(...) 方法。此方法执行以下操作。

在测试集上评估和打印分类器性能。

见SentimentAnalyzer.evaluate。

该方法有一个强制参数：test_set。

test_set – 用作黄金集的（标记、标签）元组列表。

在http://www.nltk.org/howto/sentiment.html 的示例中，test_set 具有以下结构：

[({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]

这是结构的符号表示。

[(dictionary,label), ... , (dictionary,label)]

代码错误

你正在路过

list(zip(new_data['Articles']))

到 SentimentAnalyzer.evaluate。我假设您收到错误是因为

list(zip(new_data['Articles']))

不创建 (tokens, label) 元组的列表。您可以通过创建一个包含列表的变量并打印它或在调试时查看变量的值来检查这一点。例如

test_set = list(zip(new_data['Articles']))
print("begin test_set")
print(test_set)
print("end test_set")

您在给出错误的那一行上方 3 行正确地调用了评估。

score = analyzer.evaluate(list(zip(_test_X, test_y)))

我猜你想调用 SentimentAnalyzer.classify(instance) 来预测未标记的数据。见SentimentAnalyzer.classify。

【讨论】：