在 python 中使用朴素贝叶斯对文本进行分类答案

【问题标题】：Classification of text using naive bayes in python在 python 中使用朴素贝叶斯对文本进行分类
【发布时间】：2018-03-06 10:15:41
【问题描述】：

我创建了一个模型，我在其中运行朴素贝叶斯以获得预期的输出。

from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
('Agree Completely Agree Strongly Agree Somewhat Disagree Somewhat Disagree Strongly Completely Disagree','TRUE'),
('Concerned 2 3 4 5 6 7 - Comfortable','TRUE'),
('1 - disagree strongly 2 - disagree somewhat 3 - neither agree nor disagree 4 - agree somewhat 5 - agree strongly','TRUE'),
('1 - doesn\'t apply at all 2 3 4 5 6 7 - applies completely','TRUE'),
('1 - extremely new and different 2 3 4 5 6 7 - not at all new & different','TRUE'),
('1 - extremely relevant 2 3 4 5 6 7 - not at all relevant','TRUE'),
('1 - I don\'t want brands to engage with me at all on social media 2 3 4 5 6 7 - I love to engage with brands on social media','TRUE'),
    ('1 - Most Important 2 3 4 5 - Least Important','TRUE'),    
    ('pepsi','FALSE'),
    ('coca cola','FALSE'),
    ('hyundai','FALSE'),        
    ('Audio quality','FALSE'),
    ('Product features ','FALSE'),
    ('Content ','FALSE')
]
test_corpus = [
    ('1 - Agree Completely 2 - Agree Strongly 3 - Agree Somewhat 4 - Disagree Somewhat 5 - Disagree Strongly 6 - Completely Disagree','TRUE'),
    ('1 - Concerned 2 3 4 5 6 7 - Comfortable','TRUE'),
    ('Content ','FALSE'),
    ('Ease of navigation','FALSE')
]
model = NBC(training_corpus) 
print(model.classify('pepsi'))
print(model.accuracy(test_corpus)*100)

当我运行这段代码时，它显示出 100% 的效率，但每次都返回 FALSE。我不确定出了什么问题，但这不是预期的输出。

【问题讨论】：

标签： python machine-learning naivebayes

【解决方案1】：

您的模型没问题，只是您的数据和分类器。
我的意思是你提供的训练数据，效果很好，让我们测试一下：

def test(s):
    prob_dist = model.prob_classify(s)
    print("classifiying", s)
    print("possibility of being FALSE:", round(prob_dist.prob("FALSE"), 2), 
          "possibility of being TRUE:" ,round(prob_dist.prob("TRUE"), 2))
    print('-'*70)

test_cases = ['1', '1 - ', '2', '2 3 4 5', '1- 2 3 4 5', 'pepsi', 'coca', 'BMW']
for tc in test_cases:
    test(tc)

现在是输出了，挺好的，

classifiying 1
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 1 - 
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 2
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying 2 3 4 5
possibility of being FALSE: 0.05 possibility of being TRUE: 0.95
----------------------------------------------------------------------
classifiying 1- 2 3 4 5
possibility of being FALSE: 0.0 possibility of being TRUE: 1.0
----------------------------------------------------------------------
classifiying pepsi
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying coca
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
----------------------------------------------------------------------
classifiying BMW
possibility of being FALSE: 1.0 possibility of being TRUE: 0.0
--------------------------------------------------------------------

好的，现在你想知道为什么分类器会这样工作吗？看看你的代码，你在哪里提到了特征向量？没有，所以它使用默认函数来提取特征向量为explained here。（你可以看看source code）

例如，您的模型特征可以如下所示：

model.show_informative_features()


>>> Most Informative Features
             contains(4) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(3) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(5) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(2) = False           FALSE : TRUE   =      5.6 : 1.0
             contains(1) = False           FALSE : TRUE   =      3.3 : 1.0
             contains(7) = False           FALSE : TRUE   =      2.4 : 1.0
             contains(6) = False           FALSE : TRUE   =      2.4 : 1.0
            contains(at) = False           FALSE : TRUE   =      1.9 : 1.0
           contains(all) = False           FALSE : TRUE   =      1.9 : 1.0
           contains(not) = False           FALSE : TRUE   =      1.3 : 1.0

【讨论】：

谢谢伊曼...我正在努力，如果有任何问题会告诉你。