NLTK NaiveBayesClassifier 输入格式答案

【问题标题】：NLTK NaiveBayesClassifier input formattingNLTK NaiveBayesClassifier 输入格式
【发布时间】：2014-08-06 18:48:28
【问题描述】：

我完全被这个问题难住了。我对 python 和 NLTK 比较陌生。我正在尝试制作一个朴素的贝叶斯分类器，但我不确定输入应该是一个元组列表，还是一个字典或一个由两个列表组成的元组的列表。

以下返回错误AttributeError: 'str' object has no attribute 'items'

[('maggie: just a push button. and the electric car uses sensors to drive itself. \n', 'notending')]

以下格式返回如下错误AttributeError: 'list' object has no attribute 'items'

[([['the', 'fire', 'chief', 'says', 'someone', 'started', 'the', 'blaze', 'on', 'purpose', 'as', 'a', 'controlled', 'burn', ',', 'but', 'it', 'quickly', 'got', 'out', 'of', 'hand', '.']], 'notending')]

如果我使用字典，我会收到以下错误ValueError: too many values to unpack

{'everyone: bye!': 'ending'}

我将朴素贝叶斯分类器称为classifier = nltk.NaiveBayesClassifier.train(d_train)

我不确定这里有什么问题。非常感谢您的帮助。谢谢。

【问题讨论】：

标签： python nltk

【解决方案1】：

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords
stopset = list(set(stopwords.words('english')))

def word_feats(words):
    return dict([(word, True) for word in words.split() if word not in stopset])

posids = ['I love this sandwich.', 'I feel very good about these beers.']
negids = ['I hate this sandwich.', 'I feel worst about these beers.']
pos_feats = [(word_feats(f), 'positive') for f in posids ]
neg_feats = [(word_feats(f), 'negative') for f in negids ]
print pos_feats
print neg_feats
trainfeats = pos_feats + neg_feats
classifier = NaiveBayesClassifier.train(trainfeats)

看看正面和负面的壮举

[({'I': True, 'love': True, 'sandwich.': True}, 'positive'), ({'I': True, 'feel': True, 'good': True, 'beers.': True}, 'positive')]
[({'I': True, 'hate': True, 'sandwich.': True}, 'negative'), ({'I': True, 'feel': True, 'beers.': True, 'worst': True}, 'negative')]

所以，如果你给句子“我讨厌一切”来分类

print classifier.classify(word_feats('I hate everything'))

你会得到“否定”的结果。

【讨论】：

谢谢，这似乎工作正常。在这种情况下，每个单词都被赋予了“真”标签。如果我希望分类器使用句子而不是单词，你知道我该怎么做吗？ AN 还如何针对非真实案例进行训练？
如果你想使用句子而不是你可以使用的单词，def word_feats(words): return dict([(words, True)])
一般来说，我们会在分类过程中删除停用词（and,or,that 等），这将为我们提供属于该类别的关键字。如果你给停用词“false”，那么这些值在分类过程中不会被重视。
谢谢。会试一试的。非常感激。我仍然不确定。假设我有训练的正面和负面特征。我将积极的标记为真，将消极的标记为假。我仍然可以训练并使用测试集来查看一些统计数据。现在说我得到了一组新的线条，我该如何分类。由于positive_word_feats和negative_word_feats只有一种方法，那我现在如何处理新数据？
为一个类别提供关键字比提供句子本身更好。