【发布时间】:2014-11-12 04:53:45
【问题描述】:
我在 R 社区没有经验,所以如果这不是合适的论坛,请指点我其他地方...
长话短说,恐怕e1071::naiveBayes 倾向于按字母顺序给出标签。
在较早的问题here 中,我注意到在朴素贝叶斯的e1071 实现中数值预测器的一些奇怪行为。虽然我得到了一个更合理的答案,但有些概率似乎偏向上。
谁能解释为什么这个模拟会这样结束?我现在只能想象这是一个错误......
library(e1071)
# get a data frame with numObs rows, and numDistinctLabels possible labels
# each label is randomly drawn from letters a-z
# each label has its own distribution of a numeric variable
# this is normal(i*100, 10), i in 1:numDistinctLabels
# so, if labels are t, m, and q, t is normal(100, 10), m is normal(200, 10), etc
# the idea is that all labels should be predicted just as often
# but it seems that "a" will be predicted most, "b" second, etc
doExperiment = function(numObs, numDistinctLabels){
possibleLabels = sample(letters, numDistinctLabels, replace=F)
someFrame = data.frame(
x=rep(NA, numObs),
label=rep(NA, numObs)
)
numObsPerLabel = numObs / numDistinctLabels
for(i in 1:length(possibleLabels)){
label = possibleLabels[i]
whichAreNA = which(is.na(someFrame$label))
whichToSet = sample(whichAreNA, numObsPerLabel, replace=F)
someFrame[whichToSet, "label"] = label
someFrame[whichToSet, "x"] = rnorm(numObsPerLabel, 100*i, 10)
}
someFrame = as.data.frame(unclass(someFrame))
fit = e1071::naiveBayes(label ~ x, someFrame)
# The threshold argument doesn't seem to change the matter...
someFrame$predictions = predict(fit, someFrame, threshold=0)
someFrame
}
# given a labeled frame, return the label that was predicted most
getMostFrequentPrediction = function(labeledFrame){
names(which.max(sort(table(labeledFrame$prediction))))
}
# run the experiment a few thousand times
mostPredictedClasses = sapply(1:2000, function(x) getMostFrequentPrediction(doExperiment(100, 5)))
# make a bar chart of the most frequently predicted labels
plot(table(mostPredictedClasses))
这给出了如下图:
给每个标签相同的正态分布(即平均值 100,标准差 10)给出:
关于评论中的混淆:
这可能会远离 Stack Overflow 领域,但无论如何...... 虽然我希望分类不那么笨重,但标准偏差的效果确实可以使 pdf 变平,并且您可以观察到如果您这样做的程度足以使一两个实际上倾向于占主导地位(在这种情况下是红色和黑色) .
太糟糕了,我们不能利用标准差对所有这些都相同的知识。
如果你在平均值上添加一点噪音,它会变得更加均匀分布,即使仍然存在一些错误分类。
【问题讨论】:
标签: r machine-learning classification