naiveBayes 使用词矩阵和 3+ 类进行预测答案

【问题标题】：naiveBayes using a word matrix and 3+ classes for predictionnaiveBayes 使用词矩阵和 3+ 类进行预测
【发布时间】：2013-12-01 15:38:03
【问题描述】：

我很难理解 A) naiveBayes 的输出和 B) naiveBayes 的 predict() 函数。

这不是我的数据，但这是一个有趣的示例，说明了我正在尝试做的事情以及我遇到的错误：

require(RTextTools)
require(useful)

script <- data.frame(lines=c("Rufus, Brint, and Meekus were like brothers to me. And when I say brother, I don't mean, like, an actual brother, but I mean it like the way black people use it. Which is more meaningful I think","If there is anything that this horrible tragedy can teach us, it's that a male model's life is a precious, precious commodity. Just because we have chiseled abs and stunning features, it doesn't mean that we too can't not die in a freak gasoline fight accident",
                         "Why do you hate models, Matilda","What is this? A center for ants? How can we be expected to teach children to learn how to read... if they can't even fit inside the building?","Look, I think I know what this is about and I'm complimented but not interested.",
                         "Hi Derek! My name's Little Cletus and I'm here to tell you a few things about child labor laws, ok? They're silly and outdated. Why back in the 30s, children as young as five could work as they pleased; from textile factories to iron smelts. Yippee! Hurray!","Todd, are you not aware that I get farty and bloated with a foamy latte?","Oh, I'm sorry, did my pin get in the way of your ass? Do me a favor and lose five pounds immediately or get out of my building like now!",
                         "It's that damn Hansel! He's so hot right now!","Obey my dog!",
                         "I hear words like beauty and handsomness and incredibly chiseled features and for me that's like a vanity of self absorption that I try to steer clear of.","Yeah, you're cool to hide here, but first me and him got to straighten some shit out.",
                         "I wasn't like every other kid, you know, who dreams about being an astronaut, I was always more interested in what bark was made out of on a tree. Richard Gere's a real hero of mine. Sting. Sting would be another person who's a hero. The music he's created over the years, I don't really listen to it, but the fact that he's making it, I respect that. I care desperately about what I do. Do I know what product I'm selling? No. Do I know what I'm doing today? No. But I'm here, and I'm gonna give it my best shot.","I totally agree with you. But how do you feel about male models?",
                         "So I'm rappelling down Mount Vesuvius when suddenly I slip, and I start to fall. Just falling, ahh ahh, I'll never forget the terror. When suddenly I realize Holy shit, Hansel, haven't you been smoking Peyote for six straight days, and couldn't some of this maybe be in your head?"))

people <- as.factor(c("Zoolander","Zoolander","Zoolander","Zoolander","Zoolander",
                         "Mugatu","Mugatu","Mugatu","Mugatu","Mugatu",
                         "Hansel","Hansel","Hansel","Hansel","Hansel"))

script.doc.matrix <- create_matrix(script$lines,language = "english",removeNumbers=TRUE, removeStopwords = TRUE, stemWords=FALSE)
script.matrix <- as.matrix(script.doc.matrix)

nb.script <- naiveBayes(script.matrix,people)

nb.predict <- predict(nb.script,script$lines)
nb.predict

我的问题：

A) naiveBayes 输出：

当我跑步时

nb.script$tables

我得到这样的表格：

$young
           young
people      [,1]   [,2]
  Hansel     0.0 0.0000000
  Mugatu     0.2 0.4472136
  Zoolander  0.0 0.0000000

我该如何解释这个？？？我认为这些应该是概率，但我不明白每列 [,1] 和 [,2] 的含义。另外，这些表中显示的概率不应该是 1.0 吗？他们为什么不呢？如果有第三列是有道理的，应该有吗？

我应该在naiveBayes() 中使用type=raw 吗？

B) naiveBayes 的 predict()：

输出给了我 Hansel 作为每个条目的预测。我相信这只是因为它是按字母顺序排列的头等舱。在我预测的其他情况下，如果 Hansel 被列为 4x、Mugatu 6x 和 Zoolander 5x，那么 predict() 函数最终会给我 Mugatu 作为每个条目的预测，因为它在类向量中被列出的次数最多。

编辑：对于我的问题......我怎样才能得到预测给我一个实际的预测？？？

预测输出如下：

">nb.预测

[1] 汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔·汉塞尔 [12] 汉赛尔·汉赛尔·汉赛尔·汉赛尔

关卡：Hansel Mugatu Zoolander

这是一个类似问题的链接：R: Naives Bayes classifier bases decision only on a-priori probabilities 然而，答案并没有真正帮助我太多。

提前致谢！

【问题讨论】：

很好的例子问题。

标签： r machine-learning classification text-mining

【解决方案1】：

对于您的问题的第一部分，矩阵script.matrix是数字的列。 naiveBayes将数字输入解释为来自高斯分布的连续数据。您在答案中看到的表格横跨因子类别的这些数字变量提供示例均值（第1列）和标准偏差（第2列）。

您可能想要的是才能让NaiveBayes认识到您的输入变量是指示灯。一种简单的方法是将整个script.matrix转换为字符矩阵：

# Convert columns to characters    
script.matrix <- apply(as.matrix(script.doc.matrix),2,as.character)

随着这个改变：

> nb.predict <- predict(nb.script,script$lines)
> nb.script$tables$young
           young
people        0   1
  Hansel    1.0 0.0
  Mugatu    0.8 0.2
  Zoolander 1.0 0.0

要查看预测的类：

> nb.predict <- predict(nb.script, script.matrix)
> nb.predict
 [1] Zoolander Zoolander Zoolander Zoolander Zoolander Mugatu    Mugatu   
 [8] Mugatu    Mugatu    Mugatu    Hansel    Hansel    Hansel    Hansel   
[15] Hansel   
Levels: Hansel Mugatu Zoolander

查看NaiveBayes Fit的原始概率：

predict(nb.script, script.matrix, type='raw')

【讨论】：

对于第二个问题，我正试图了解预测为什么要先按字母顺序列出的类。所以，而不是给出一种预测，例如：Zoolander，Hansel，Mugatu，Mugatu，Mugatu，Zoolander，Hansel，Zoolander等。predict() 987654328 @输出给我Hansel作为每个条目的预测，因为H
确保在script.matrix（使用转换为字符的列）运行预测，而不是RAW script$lines。我更新了答案以返回预测的类而不是原始概率。 span>
啊，这就是为什么！非常感谢！