【发布时间】:2016-01-02 12:41:08
【问题描述】:
我今天早上问了一个问题,但我删除了这个问题并在此处发布了更好的措辞。
我使用训练和测试数据创建了我的第一个机器学习模型。我返回了一个混淆矩阵并看到了一些摘要统计信息。
我现在想将模型应用于新数据以进行预测,但我不知道如何。
背景:预测每月的“流失”取消。目标变量是“搅动”的,它有两个可能的标签“搅动”和“未搅动”。
head(tdata)
months_subscription nvk_medium org_type churned
1 25 none Community not churned
2 7 none Sports clubs not churned
3 28 none Sports clubs not churned
4 18 unknown Religious congregations and communities not churned
5 15 none Association - Professional not churned
6 9 none Association - Professional not churned
这是我的训练和测试:
library("klaR")
library("caret")
# import data
test_data_imp <- read.csv("tdata.csv")
# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]
#training
rn_train <- sample(nrow(tdata),
floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)
# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)
到目前为止一切正常。
现在我有了新的数据、结构和布局方式与上面的 tdata 相同。如何将我的模型应用于这些新数据以进行预测?直觉上,我正在寻找一个新列 cbinded,其中包含每条记录的预测类。
我试过了:
## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]
actual_predictions <- predict(model, pdata)
#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)
# output
head(predicted_data)
引发错误
actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
Numerical 0 probability for all classes with observation 3
如何将我的模型应用于新数据?我想要一个包含预测类的新列的新数据框?
** 以下是评论,这里是用于预测的新数据的 head 和 str **
head(pdata)
months_subscription nvk_medium org_type churned
1 26 none Community not churned
2 8 none Sports clubs not churned
3 30 none Sports clubs not churned
4 19 unknown Religious congregations and communities not churned
5 16 none Association - Professional not churned
6 10 none Association - Professional not churned
> str(pdata)
'data.frame': 6433 obs. of 4 variables:
$ months_subscription: int 26 8 30 19 16 10 3 5 14 2 ...
$ nvk_medium : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
$ org_type : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
$ churned : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...
【问题讨论】:
-
变量
pdata中的数据是什么样的?能否请您添加head(pdata)的结果? -
嗨@tguzella 与 tdata 完全相同,除了所有流失的实例都说“未流失”(因为我想预测哪个会流失)
-
好吧,考虑到错误,我倾向于认为数据与
tdata不一样... 处理作为因素的功能时似乎会触发错误。但是,如果您不显示数据,则基本上无法判断出了什么问题 -
嗨@tguzella 我之前在手机上,所以无法添加数据。但是我现在已经添加了 pdata 的 head 和 str 。非常欢迎任何指点或帮助。
标签: r naivebayes