R随机森林 - 使用目标列进行预测的训练集答案

【问题标题】：R random forest - training set using target column for predictionR随机森林 - 使用目标列进行预测的训练集
【发布时间】：2014-08-04 01:52:50
【问题描述】：

我正在学习如何使用各种随机森林包，并从示例代码中编写了以下代码：

library(party)
library(randomForest)

set.seed(415)

#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65]  #basically data w/o the "answers"

m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)

train2 = data[m,]
train3 = data[o,]

#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]

#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]

Data[,66] 是我试图预测的目标因素，但似乎通过使用“~”。解决它导致公式使用预测模型本身中的因子。

如何在高维数据上求解我想要的维度，而不必在公式中准确说明要使用的维度（所以我最终不会得到某种 cforest(data[,66 ] ~ data[,1] + data[,2] + data[,3}... 等等？

编辑：在高层次上，我基本上相信一个

加载完整数据
将其分解为几个子集以防止过度拟合
通过子集数据训练
生成一个拟合公式，因此可以在给定数据 [1:65] 的情况下预测目标值（在我的情况下为 data[,66]）。

所以我现在的问题是，如果我给它一组新的测试数据，比如说 test = data{1:65]，它现在会显示“eval(expr, envir, enclos) 中的错误：”它所期望的位置数据[,66]。考虑到其余数据，我想基本上预测 data[,66]！

【问题讨论】：

我在 library(randomForest) 中没有看到名为 cforest 的文档化函数。这是正确的包装吗？
data 有列名吗？ train3 是什么？ train3 只有协变量吗？从您的示例看来，data 具有所有变量，因此可能应该在 data= 参数中。这就是为什么最好提供reproducible example。
@MrFlick - 哎呀，派对包。
数据有列名——但使用它们有帮助吗？由于它是一个高（嗯，60+）维向量，我没有在导入时使用 c(col) 拼出列，但我确实做了一些预处理以确保所有维度都可以用数字格式表示。 train3 = 是训练集，随机选取 50% 的数据子集。（感谢您的编辑。）

标签： r random-forest

【解决方案1】：

我认为如果响应在train3 中，那么它将被用作功能。

我相信这更像你想要的：

crtl <- cforest_unbiased(ntree=1000, mtry=3)

mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

【讨论】：

哎呀，我把它改了，详细说明了一点。
我的理解是，如果您传递 cforest 或 randomForest 一个包含 66 列的数据集，那么它将适合该模型，并且需要新数据具有 66 列。当您适合模型时，您似乎将响应作为一项功能包含在内，这就是为什么它需要 66 列，而当您尝试将 predict 与只有 65 列的数据框一起使用时不起作用。
Doh - 那么我应该如何使用解决方案进行训练，但我的一组预测中没有包含这些解决方案？
使用randomForest，您可以制作特征矩阵x和响应y，然后通过randomForest(x = x, y = y, ...)拟合模型； cforest 似乎没有此选项，因此可能需要像我在原始答案中所做的那样：cforest(dat[,66] ~ ., data = dat[,-66])。