使用插入符号构建 RandomForest答案

【问题标题】：Building a RandomForest with caret使用插入符号构建 RandomForest
【发布时间】：2020-01-16 06:20:18
【问题描述】：

我试图按照here 的步骤在插入符号中构建一个 RandomForest 模型。本质上，他们设置了 RandomForest，然后是最好的 mtry，然后是最好的 maxnodes，然后是最好的树数。这些步骤是有道理的，但是搜索这三个因素的相互作用会比一次一个更好吗？

其次，我了解对 mtry 和 ntrees 执行网格搜索。但我不知道将最小节点数或最大节点数设置在什么位置。一般是否建议保留如下所示的默认节点大小？

library(randomForest)
library(caret)
mtrys<-seq(1,4,1)
ntrees<-c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)
combo_mtrTrees<-data.frame(expand.grid(mtrys, ntrees))
colnames(combo_mtrTrees)<-c('mtrys','ntrees')

tuneGrid <- expand.grid(.mtry = c(1: 4))
for (i in 1:length(ntrees)){
  ntree<-ntrees[i]
  set.seed(65)
  rf_maxtrees <- train(Species~.,
                       data = df,
                       method = "rf",
                       importance=TRUE,
                       metric = "Accuracy",
                       tuneGrid = tuneGrid,
                       trControl = trainControl( method = "cv",
                                                 number=5,
                                                 search = 'grid',
                                                 classProbs = TRUE,
                                                 savePredictions = "final"),
                       ntree = ntree
                       )
  Acc1<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==1]
  Acc2<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==2]
  Acc3<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==3]
  Acc4<-rf_maxtrees$results$Accuracy[rf_maxtrees$results$mtry==4]
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==1 & combo_mtrTrees$ntrees==ntree]<-Acc1
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==2 & combo_mtrTrees$ntrees==ntree]<-Acc2
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==3 & combo_mtrTrees$ntrees==ntree]<-Acc3
  combo_mtrTrees$Acc[combo_mtrTrees$mtrys==4 & combo_mtrTrees$ntrees==ntree]<-Acc4
}

【问题讨论】：

嗨，杰克，您当时需要将帖子限制在 一个问题。另外，请确保您不要求主观解决方案。

标签： r random-forest r-caret

【解决方案1】：

是的，最好搜索参数的交互作用。
nodesize 和 maxnodes 通常保持默认，但没有理由不调整它们。就我个人而言，我会将maxnodes 保留为默认值，并可能调整nodesize - 它可以看作是一个正则化参数。要了解要尝试的值，请检查 rf 中的默认值，其中 1 用于分类，5 用于回归。所以尝试 1-10 是一种选择。
在您的示例中执行循环调整时，建议始终使用相同的交叉验证折叠。您可以在调用循环之前使用createFolds 创建它们。
调整后，请务必在独立验证集上评估您的结果或执行nested cross validation，其中内部循环将用于调整参数，外部循环用于估计模型性能。由于仅交叉验证的结果会出现乐观偏差。
在大多数情况下，准确性并不是选择最佳分类模型的合适指标。特别是在数据集不平衡的情况下。阅读接收器操作特性 auc、Cohen's kappa、Matthews 相关系数、平衡准确度、F1 分数、分类阈值调整。
这是一个关于如何联合调整rf 参数的示例。我将使用来自 mlbench 包的 Sonar 数据集。

创建预定义的折叠：

library(caret) 
library(mlbench)
data(Sonar)

set.seed(1234)
cv_folds <- createFolds(Sonar$Class, k = 5, returnTrain = TRUE)

创建曲调控制：

tuneGrid <- expand.grid(.mtry = c(1 : 10))

ctrl <- trainControl(method = "cv",
                     number = 5,
                     search = 'grid',
                     classProbs = TRUE,
                     savePredictions = "final",
                     index = cv_folds,
                     summaryFunction = twoClassSummary) #in most cases a better summary for two class problems

定义其他参数进行调整。我将仅使用几种组合来限制示例的训练时间：

ntrees <- c(500, 1000)    
nodesize <- c(1, 5)

params <- expand.grid(ntrees = ntrees,
                      nodesize = nodesize)

火车：

store_maxnode <- vector("list", nrow(params))
for(i in 1:nrow(params)){
  nodesize <- params[i,2]
  ntree <- params[i,1]
  set.seed(65)
  rf_model <- train(Class~.,
                       data = Sonar,
                       method = "rf",
                       importance=TRUE,
                       metric = "ROC",
                       tuneGrid = tuneGrid,
                       trControl = ctrl,
                       ntree = ntree,
                       nodesize = nodesize)
  store_maxnode[[i]] <- rf_model
  }

################### 26.02.2021.

为了避免通用模型名称 - model1, model2 ...我们可以使用相应的参数命名结果列表：

names(store_maxnode) <- paste("ntrees:", params$ntrees,
                              "nodesize:", params$nodesize)

################### 26.02.2021.

合并结果：

results_mtry <- resamples(store_maxnode)

summary(results_mtry)

输出：

Call:
summary.resamples(object = results_mtry)

Models: ntrees: 500 nodesize: 1, ntrees: 1000 nodesize: 1, ntrees: 500 nodesize: 5, ntrees: 1000 nodesize: 5 
Number of resamples: 5 

ROC 
                              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
ntrees: 500 nodesize: 1  0.9108696 0.9354067 0.9449761 0.9465758 0.9688995 0.9727273    0
ntrees: 1000 nodesize: 1 0.8847826 0.9473684 0.9569378 0.9474828 0.9665072 0.9818182    0
ntrees: 500 nodesize: 5  0.9163043 0.9377990 0.9569378 0.9481652 0.9593301 0.9704545    0
ntrees: 1000 nodesize: 5 0.9000000 0.9342105 0.9521531 0.9462321 0.9641148 0.9806818    0

Sens 
                              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
ntrees: 500 nodesize: 1  0.9090909 0.9545455 0.9545455 0.9549407 0.9565217 1.0000000    0
ntrees: 1000 nodesize: 1 0.9090909 0.9130435 0.9545455 0.9371542 0.9545455 0.9545455    0
ntrees: 500 nodesize: 5  0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0
ntrees: 1000 nodesize: 5 0.9090909 0.9545455 0.9545455 0.9458498 0.9545455 0.9565217    0

Spec 
                         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
ntrees: 500 nodesize: 1  0.65 0.6842105 0.7368421 0.7421053 0.7894737 0.8500000    0
ntrees: 1000 nodesize: 1 0.60 0.6842105 0.7894737 0.7631579 0.8421053 0.9000000    0
ntrees: 500 nodesize: 5  0.55 0.6842105 0.7894737 0.7331579 0.8000000 0.8421053    0
ntrees: 1000 nodesize: 5 0.60 0.6842105 0.7368421 0.7321053 0.7894737 0.8500000    0

为每个模型获得最佳 mtry：

lapply(store_maxnode, function(x) x$best)
#output
$`ntrees: 500 nodesize: 1`
  mtry
1    1

$`ntrees: 1000 nodesize: 1`
  mtry
2    2

$`ntrees: 500 nodesize: 5`
  mtry
1    1

$`ntrees: 1000 nodesize: 5`
  mtry
1    1

################### 26.02.2021.
或者为每个模型获得最佳平均性能

lapply(store_maxnode, function(x) x$results[x$results$ROC == max(x$results$ROC),])
#output
$`ntrees: 500 nodesize: 1`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
1    1 0.9465758 0.9549407 0.7421053 0.02541895 0.03215337 0.0802308

$`ntrees: 1000 nodesize: 1`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
2    2 0.9474828 0.9371542 0.7631579 0.03728797 0.02385499 0.1209382

$`ntrees: 500 nodesize: 5`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
1    1 0.9481652 0.9458498 0.7331579 0.02133659 0.02056666 0.1177407

$`ntrees: 1000 nodesize: 5`
  mtry       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
1    1 0.9462321 0.9458498 0.7321053 0.03091747 0.02056666 0.0961229

从这个玩具示例中，您可以看到 ROC 曲线 (ROC) 下的最高平均（超过 5 倍）面积是通过 ntrees：500、nodesize：5 和 mtry：1 实现的，它等于 0.948。 ###################

您也可以使用默认摘要

ctrl <- trainControl(method = "cv",
                         number = 5,
                         search = 'grid',
                         classProbs = TRUE,
                         savePredictions = "final",
                         index = cv_folds)

并在train中定义metric = "Kappa"

【讨论】：

超越问题！
感谢您的精彩回答！你能告诉我如何查看最佳节点大小和 ntrees 吗？我试过summary(resamples(nodesize))。还要澄清一下，这四个模型是否代表了这个 5 折交叉验证中的四个训练折？
非常感谢您的编辑！！跟进这一点，我观察到，如果您只是通过 rf_model 提取最佳参数（例如通过 rf_model$finalModel$params），它有时会选择一个 node__size 和 ntree，从上面直观地查看您的 lapply 实际上并不给你最好的 ROC，而是一个中间的。我总是假设metric=ROC $finalModel 总是有最好的参数？
@PleaseHelp 对于上述示例，finalModel 中的婴儿车与lapply 调用中使用的参数相匹配：lapply(store_maxnode, function(x) x$finalModel$param)。如果您有一个他们不知道的示例，请发布一个新问题。