为什么指定 sampsize 不能加速 randomForest？答案

【问题标题】：Why does specifying sampsize not speed up randomForest?为什么指定 sampsize 不能加速 randomForest？
【发布时间】：2018-08-11 16:35:03
【问题描述】：

我正在尝试使用包 randomForest 在 R 中的 this large dataset 上运行随机森林回归。即使与 doSNOW 和 10-20 个内核并行，我也遇到了所需计算时间的问题。我想我误解了函数 randomForest 中的“sampsize”参数。当我将数据集子集为 100,000 行时，我可以在 9-10 秒内构建一棵树。

training = read.csv("training.csv")
t100K = sample_n(training, 100000)
system.time(randomForest(tree~., data=t100K, ntree=1, importance=T)) #~10sec

但是，当我在运行 randomForest 的过程中使用 sampsize 参数从完整数据集中采样 100,000 行时，同一棵树需要几个小时。

system.time(randomForest(tree~., data=training, sampsize = ifelse(nrow(training<100000),nrow(training), 100000), ntree=1, importance=T)) #>>100x as long. Why?

显然，我最终将运行 >>1 树。我在这里想念什么？谢谢。

【问题讨论】：

标签： r machine-learning regression random-forest sample

【解决方案1】：

你的括号有点偏离。请注意以下陈述之间的区别。您目前拥有：

ifelse(nrow(mtcars<10),nrow(mtcars), 10)

计算布尔矩阵 mtcars<10 中的行数，该矩阵对于 mtcars 中小于 10 的每个元素具有 TRUE，否则计算 FALSE。你想要：

ifelse(nrow(mtcars)<10,nrow(mtcars), 10)

希望这会有所帮助。

【讨论】：

啊。谢谢。有趣的是，在正确放置括号的情况下，随着树木数量的增加，第二种情况会比第一种情况更快。