bstTree 预测的混淆矩阵，错误：“数据必须包含一些与参考重叠的级别。”答案

【问题标题】：confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'bstTree 预测的混淆矩阵，错误：“数据必须包含一些与参考重叠的级别。”
【发布时间】：2016-12-14 06:06:42
【问题描述】：

我正在尝试使用 bstTree 方法训练模型并打印出混淆矩阵。不利影响是我的类属性。

set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]

ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)


predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])


plot.roc(auc_bstTree)

conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)

但我收到错误“confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) 中的错误：数据必须包含一些与参考重叠的级别。'

 max(pred_bstTree)
[1] 1.03385
 min(pred_bstTree)
[1] 1.011738

> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1

我该如何解决这个问题？

> head(trainSplit)
   type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5     2          1              14           13      2               0                  0             0                0
7     2          0              14           13      2               0                  0             0                0
8     2          0              14           13      2               0                  0             0                0
9     2          0              14           13      2               1                  0             0                0
11    2          1              14           13      2               0                  0             0                0
12    2          0              14           13      2               0                  0             0                0
   uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5              5                1         1         1  22 3.000     0
7              5                0         1         1  22 4.320     0
8              5                0         1         1  22 4.752     0
9              5                0         1         1  22 5.000     0
11             5                1         1         1  22 5.000     0
12             5                0         1         1  22 5.000     0

【问题讨论】：

看起来您预测的是回归而不是分类。检查是否将不利影响设置为数据中的一个因素。
是的，它是一个包含 0 和 1 的因子 phiver。即使我在转换为数字后进行预测，我也会得到相同的错误
尝试添加数据样本。很难看出问题出在哪里。

标签： r prediction r-caret confusion-matrix

【解决方案1】：

max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738

错误说明了一切。绘制 ROC 只是检查不同阈值点的影响。基于阈值舍入发生，例如0.7 将转换为 1（TRUE 类），0.3 将转换为 0（FALSE 类）；如果阈值为 0.5。阈值在 (0,1) 范围内

在您的情况下，无论阈值如何，您始终会将所有观察结果归为 TRUE 类，因为即使最小预测值也大于 1。（这就是为什么 @phiver 想知道您是否在进行回归而不是分类）。如果预测中没有任何零，则“预测”中没有与adverse_effects 中的零水平一致的级别，因此会出现此错误。

PS：如果不发布数据，将很难说出错误的根本原因

【讨论】：

abhiieor，数据集包含近40000条记录，但88%的数据属于0类，其余属于1类。
您提供的数据太少，无法复制。我希望在制作adverse_effects 因素时，您已经完成了model_bstTree <- train(as.factor(adverse_effects) ~ ., data = trainSplit, method = "bstTree", trControl = ctrl) 或attended_num_new_bstTree$adverse_effects <- as.factor(attended_num_new_bstTree$adverse_effects)。如果是，那么我建议您尝试任何其他分类方法，例如逻辑回归、随机森林、GBM 等，看看您是否看到相同的行为。理想情况下，您不会得到相同的行为。

【解决方案2】：

我有类似的问题，这是指这个错误。我使用了函数confusionMatrix：

confusionMatrix(actual, predicted, cutoff = 0.5)

我收到以下错误：Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.

我检查了几件事，例如：

class(actual) -> 数字

class(predicted) -> 整数

unique(actual) -> 很多值，因为它是概率

unique(predicted) -> 2 级：0 和 1

我的结论是，应用截止部分功能有问题，所以我之前这样做过：

predicted<-ifelse(predicted> 0.5,1,0)

并运行 confusionMatrix 函数，它现在可以正常工作了：

cm<- confusionMatrix(actual, predicted) cm$table

产生了正确的结果。

您的案例的一个要点，一旦您使代码工作，这可能会改善解释：您混合了混淆矩阵的输入值（根据混淆矩阵包文档），而不是：

conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)

你应该写：

conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)

如前所述，一旦你想办法让它发挥作用，它很可能会帮助你解释混淆矩阵。

希望对你有帮助。

【讨论】：