【问题标题】:R-Caret: how to build a more efficient model with multiple models and predict new resultsR-Caret:如何使用多个模型构建更高效的模型并预测新结果
【发布时间】:2015-05-22 11:49:00
【问题描述】:

我的训练数据集 (train) 是一个具有 n 特征 的数据框和一个带有结果 y 的附加列。我建立了 3 个个体模型,例如:

m1 <- train(y ~ ., data = train, method = "lda")
m2 <- train(y ~ ., data = train, method = "rf")
m3 <- train(y ~ ., data = train, method = "gbm")

借助测试数据集 (test),我可以评估这些个体模型的质量(自然而然,它会产生结果y):

pred1 <- predict(m1, newdata = test)
pred2 <- predict(m2, newdata = test)
pred3 <- predict(m3, newdata = test)

如果我在数据帧 DATA_TO_PREDICT(结果未知)中应用每个单独的模型,并带有 5 个示例,则输出自然是每个单独模型的 5 个预测:

predict(m1, DATA_TO_PREDICT)
predict(m2, DATA_TO_PREDICT)
predict(m3, DATA_TO_PREDICT)

现在我想使用 R-Caret-Package 中的组合模型和随机森林:

DF <- data.frame(pred1, pred2, pred3, y = test$y)
MODEL <- train(y ~ ., data = DF, method = "rf")

我可以观察到组合模型的准确度有所提高:

predMODEL <- predict(MODEL, DF)

但是,如果我在 DATA_TO_PREDICT 中应用组合模型(结果未知),则输出不仅有 5 个预测,而且还有一个包含重复结果且大于 10 个的巨大列表。我用过:

predict(MODEL, newdata = DATA_TO_PREDICT)

示例:

这里我展示了一个输出错误的具体例子。也就是说,我想预测 4 个新数据,但我得到的结果有几十个输出:

library(caret)
library(gbm)
set.seed(10)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

inTEST <- (5:nrow(testing))
test <- testing[inTEST,]
DATA_TO_PREDICT <- testing[-inTEST,]

m1 <- train(diagnosis ~ ., data=training, method="rf")
m2 <- train(diagnosis ~ ., data=training, method="gbm")
m3 <- train(diagnosis ~ ., data=training, method="lda")
p1 <- predict(m1, newdata = test)
p2 <- predict(m2, newdata = test)
p3 <- predict(m3, newdata = test)

DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
predMODEL <- predict(MODEL, DF)

如果我建立了组合模型:

pred1 <- predict(m1, DATA_TO_PREDICT)
pred2 <- predict(m2, DATA_TO_PREDICT)
pred3 <- predict(m3, DATA_TO_PREDICT)
DF2 <- data.frame(pred1, pred2, pred3)
predict(MODEL, newdata = DF2) 

请注意,DATA_TO_PREDICT 只有 4 个示例,输出为:

  [1] Control Control Control Control Control Control Control Control
  [9] Control Control Control Control Control Control Control Control
 [17] Control Control Control Control Control Control Control Control
 [25] Control Control Control Control Control Control Control Control
 [33] Control Control Control Control Control Control Control Control
 [41] Control Control Control Control Control Control Control Control
 [49] Control Control Control Control Control Control Control Control
 [57] Control Control Control Control Control Control Control Control
 [65] Control Control Control Control Control Control Control Control
 [73] Control Control Control Control Control Control
 Levels: Impaired Control

【问题讨论】:

    标签: r machine-learning r-caret


    【解决方案1】:

    这是因为 MODEL 接受了三个单独模型(pred1pred2pred3 用于测试数据)的预测的训练,并且在最后一步中,DATA_TO_PREDICT 被提供给 MODEL而是由观察组成。首先,必须存储DATA_TO_PREDICT 的各个模型的预测值,然后将其用作newdataMODEL

    # (Beginning of the example omitted)
    DF <- data.frame(p1, p2, p3, diagnosis = test$diagnosis)
    # This trains a model with predictions as inputs:
    MODEL <- train(diagnosis ~ ., data = DF, method = "rf")
    
    # This is missing ----------------------
    # To get the inputs for the ensemble model
    # the predictions for DATA_TO_PREDICT are needed
    p1b <- predict(m1, newdata = DATA_TO_PREDICT)
    p2b <- predict(m2, newdata = DATA_TO_PREDICT)
    p3b <- predict(m3, newdata = DATA_TO_PREDICT)
    DFb <- data.frame(p1b, p2b, p3b)
    colnames(DFb) <- c("p1", "p2", "p3")
    #----------------------------------------
    
    predMODEL <- predict(MODEL, DFb)
    # [1] Control Control Control Control 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-02-21
      • 2020-05-23
      • 2015-10-28
      • 2012-03-18
      • 2018-08-19
      • 2015-06-21
      • 2018-11-28
      相关资源
      最近更新 更多