朴素贝叶斯分类器仅基于先验概率做出决策答案

【问题标题】：Naive Bayes classifier bases decision only on a-priori probabilities朴素贝叶斯分类器仅基于先验概率做出决策
【发布时间】：2013-08-17 21:53:54
【问题描述】：

我正在尝试根据推文的情绪将推文分为三类（买入、持有、卖出）。我正在使用 R 和包 e1071。

我有两个数据框：一个训练集和一组需要预测情绪的新推文。

训练集数据框：

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+

现在我想使用推文文本trainingset[,2] 和情感类别trainingset[,4] 来训练模型。

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

用

查看分类器的元素

classifier$tables$x

我发现条件概率是计算出来的。每条关于买入、持有和卖出的推文都有不同的概率。到目前为止一切都很好。

但是，当我预测训练集时：

predict(classifier, trainingset[,2], type="raw")

我得到一个仅基于先验概率的分类，这意味着每条推文都被归类为持有（因为“持有”在情绪中所占份额最大）。所以每条推文都有相同的买入、持有和卖出概率：

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+

任何想法我做错了什么？感谢您的帮助！

谢谢

【问题讨论】：

标签： r machine-learning classification text-mining

【解决方案1】：

您似乎使用整个句子作为输入来训练模型，而您似乎想使用单词作为输入特征。

用法：

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)

参数：

  x: A numeric matrix, or a data frame of categorical and/or
     numeric variables.

  y: Class vector.

特别是，如果您以这种方式训练 naiveBayes：

x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )

你得到一个能够识别这两个句子的分类器：

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.5  0.5 

Conditional probabilities:
            x
      x
y      john likes cake marry likes cats and john
  bad                0                         1
  good               1                         0

要实现单词级别分类器，您需要使用单词作为输入来运行它

x <-             c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )

你得到

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.625 0.375 

Conditional probabilities:
      x
y            and      cake      cats      john     likes     marry
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000

一般R不太适合处理NLP数据，python（或至少Java）会是更好的选择。

要将句子转换为单词，可以使用strsplit函数

unlist(strsplit("john likes cake"," "))
[1] "john"  "likes" "cake"

【讨论】：

Re: 将推文转换为文字，这也可以在 R 中使用 tm 文本挖掘包 (cran.r-project.org/web/packages/tm/) 轻松完成。有许多工具可以简化流程，例如删除停用词（例如“the”、“it”）、大写等。该软件包有一个不错的 vignette 值得探索。