rpart 模型在插入符号中折叠为零分裂答案

【问题标题】：rpart models collapse to zero splits in caretrpart 模型在插入符号中折叠为零分裂
【发布时间】：2013-11-10 14:59:47
【问题描述】：

我正在使用 rpart 在插入符号包中运行回归树分析，并使用 oneSE 选项进行选择功能。当我这样做时，我经常会得到一个零分裂的模型。它表明没有模型会比任何模型更好。这应该发生吗？

这是一个例子：

# set training controls
tc <- trainControl("repeatedcv", repeats=100, selectionFunction="oneSE", num=10)

# run the model
mod <- train(yvar ~ ., data=dat, method="rpart", trControl=tc)

# it runs.....
# look at the cptable of the final model
printcp(mod$finalModel)

这是模型输出：

> mod
No pre-processing
Resampling: Cross-Validation (10 fold, repeated 100 times) 

Summary of sample sizes: 81, 79, 80, 80, 80, 80, ... 

Resampling results across tuning parameters:

  cp      RMSE   Rsquared  RMSE SD  Rsquared SD
  0.0245  0.128  0.207     0.0559   0.23       
  0.0615  0.127  0.226     0.0553   0.241      
  0.224   0.123  0.193     0.0534   0.195      

RMSE was used to select the optimal model using  the one SE rule.
The final value used for the model was cp = 0.224.

这是 printcp 的输出：

Variables actually used in tree construction:

character(0)
Root node error: 1.4931/89 = 0.016777
n= 89 
CP nsplit rel error
1 0.22357      0         1

但是，如果我直接在 rpart 中运行模型，我可以看到更大的、未修剪的树被修剪为上面所谓的更简约的模型：

unpruned = rpart(yvar ~., data=dat)
printcp(unpruned)

Regression tree:
rpart(formula = yvar ~ ., data = dat)

Variables actually used in tree construction:
[1] c.n.ratio Fe.ppm    K.ppm     Mg.ppm    NO3.ppm  

Root node error: 1.4931/89 = 0.016777

n= 89 

    CP nsplit rel error xerror    xstd
1 0.223571      0   1.00000 1.0192 0.37045
2 0.061508      2   0.55286 1.1144 0.33607
3 0.024537      3   0.49135 1.1886 0.38081
4 0.010539      4   0.46681 1.1941 0.38055
5 0.010000      6   0.44574 1.2193 0.38000

Caret [我认为] 试图找到最小的树，其 RMSE 与具有最低 RMSE 的模型的 1 SD 以内。这类似于 Venebles 和 Ripley 提倡的 1-SE 方法。在这种情况下，即使没有解释力，选择没有拆分的模型似乎也会陷入困境。

这是对的吗？这个可以吗？似乎应该有一条规则来防止选择没有拆分的模型。

【问题讨论】：

标签： r r-caret rpart cart-analysis

【解决方案1】：

尝试消除selectionFunction="oneSE"。

这应该确定具有最小可能误差的深度。这样做时，选择观察到的最小 RMSE 可能会产生“优化偏差”，但我发现在实践中它很小。

最大

【讨论】：

感谢您的回复 Max。使用 selectionFunction="best" 选项似乎并不能真正解决我遇到的结果。也许还有另一种方式来问这个......有没有办法让 rpart 最初尝试更多的代理拆分，这样它就不会在初始拆分时挂断？在某些情况下，我可以向数据集添加其他变量并获得一个树模型，其初始拆分是原始池中未能生成树的变量之一。