如何将朴素贝叶斯模型应用于新数据答案

【问题标题】：How to apply Naive Bayes model to new data如何将朴素贝叶斯模型应用于新数据
【发布时间】：2016-01-02 12:41:08
【问题描述】：

我今天早上问了一个问题，但我删除了这个问题并在此处发布了更好的措辞。

我使用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混淆矩阵并看到了一些摘要统计信息。

我现在想将模型应用于新数据以进行预测，但我不知道如何。

背景：预测每月的“流失”取消。目标变量是“搅动”的，它有两个可能的标签“搅动”和“未搅动”。

    head(tdata)
  months_subscription nvk_medium                                org_type     churned
1                  25       none                               Community not churned
2                   7       none                            Sports clubs not churned
3                  28       none                            Sports clubs not churned
4                  18    unknown Religious congregations and communities not churned
5                  15       none              Association - Professional not churned
6                   9       none              Association - Professional not churned

这是我的训练和测试：

 library("klaR")
 library("caret")

# import data
test_data_imp <- read.csv("tdata.csv")

# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]

#training
rn_train <- sample(nrow(tdata),
                   floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)

# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)

到目前为止一切正常。

现在我有了新的数据、结构和布局方式与上面的 tdata 相同。如何将我的模型应用于这些新数据以进行预测？直觉上，我正在寻找一个新列 cbinded，其中包含每条记录的预测类。

我试过了：

## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]

actual_predictions <- predict(model, pdata)

#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)

# output
head(predicted_data)

引发错误

actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 3

如何将我的模型应用于新数据？我想要一个包含预测类的新列的新数据框？

** 以下是评论，这里是用于预测的新数据的 head 和 str **

head(pdata)
  months_subscription nvk_medium                                org_type     churned
1                  26       none                               Community not churned
2                   8       none                            Sports clubs not churned
3                  30       none                            Sports clubs not churned
4                  19    unknown Religious congregations and communities not churned
5                  16       none              Association - Professional not churned
6                  10       none              Association - Professional not churned
> str(pdata)
'data.frame':   6433 obs. of  4 variables:
 $ months_subscription: int  26 8 30 19 16 10 3 5 14 2 ...
 $ nvk_medium         : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
 $ org_type           : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
 $ churned            : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...

【问题讨论】：

变量pdata中的数据是什么样的？能否请您添加head(pdata) 的结果？
嗨@tguzella 与 tdata 完全相同，除了所有流失的实例都说“未流失”（因为我想预测哪个会流失）
好吧，考虑到错误，我倾向于认为数据与tdata不一样... 处理作为因素的功能时似乎会触发错误。但是，如果您不显示数据，则基本上无法判断出了什么问题
嗨@tguzella 我之前在手机上，所以无法添加数据。但是我现在已经添加了 pdata 的 head 和 str 。非常欢迎任何指点或帮助。

标签： r naivebayes

【解决方案1】：

这很可能是由于训练数据（在您的情况下为变量 tdata）和 predict 函数中使用的新数据（变量 pdata）中的因子编码不匹配造成的，通常是您在测试数据中具有训练数据中不存在的因子水平。您必须强制执行特征编码的一致性，因为predict 函数不会检查它。因此，我建议您仔细检查两个变量中的特征nvk_medium 和org_type 的级别。

错误信息：

 Error in object$tables[[v]][, nd] : subscript out of bounds

在评估数据点中的给定特征（v-th 特征）时引发，其中nd 是与该特征对应的因子的数值。您也有警告，表明数据点（“观察”）1、2 和 3 中所有情况的后验概率均为零，但尚不清楚这是否也与因子的编码有关。 .

要重现您看到的错误，请考虑以下玩具数据（来自 http://amunategui.github.io/binary-outcome-modeling/），它的一组特征与您的数据中的有些相似：

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]

与：

> str(Data_train)

'data.frame':   656 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age     : num  35 28 34 28 29 28 28 28 45 28 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...

> str(Data_test)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

然后一切都按预期进行：

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

但是，如果编码不一致，例如：

# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)

> str(Data_test_2)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

然后：

> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds

【讨论】：

非常感谢你说得通。我查看了 medium 和 org_type 并发现了低计数级别的长尾，因此通过将方差（级别？）减少到 6 将它们分组到更高级别。现在一切都按预期工作！谢谢