ConfusionMatrix中的错误数据和参考因素必须具有相同的水平数答案

【问题标题】：Error in ConfusionMatrix the data and reference factors must have the same number of levelsConfusionMatrix中的错误数据和参考因素必须具有相同的水平数
【发布时间】：2014-09-08 04:50:13
【问题描述】：

我已经用 R 插入符号训练了一个树模型。我现在正在尝试生成混淆矩阵并不断收到以下错误：

confusionMatrix.default(predictionsTree, testdata$catgeory) 中的错误 : 数据和参考因子的水平数必须相同

prob <- 0.5 #Specify class split
singleSplit <- createDataPartition(modellingData2$category, p=prob,
                                   times=1, list=FALSE)
cvControl <- trainControl(method="repeatedcv", number=10, repeats=5)
traindata <- modellingData2[singleSplit,]
testdata <- modellingData2[-singleSplit,]
treeFit <- train(traindata$category~., data=traindata,
                 trControl=cvControl, method="rpart", tuneLength=10)
predictionsTree <- predict(treeFit, testdata)
confusionMatrix(predictionsTree, testdata$catgeory)

生成混淆矩阵时出现错误。两个对象的级别相同。我无法弄清楚问题是什么。它们的结构和级别如下所示。他们应该是一样的。任何帮助将不胜感激，因为它让我崩溃了！！

> str(predictionsTree)
 Factor w/ 30 levels "16-Merchant Service Charge",..: 28 22 22 22 22 6 6 6 6 6 ...
> str(testdata$category)
 Factor w/ 30 levels "16-Merchant Service Charge",..: 30 30 7 7 7 7 7 30 7 7 ...

> levels(predictionsTree)
 [1] "16-Merchant Service Charge"   "17-Unpaid Cheque Fee"         "18-Gov. Stamp Duty"           "Misc"                         "26-Standard Transfer Charge" 
 [6] "29-Bank Giro Credit"          "3-Cheques Debit"              "32-Standing Order - Debit"    "33-Inter Branch Payment"      "34-International"            
[11] "35-Point of Sale"             "39-Direct Debits Received"    "4-Notified Bank Fees"         "40-Cash Lodged"               "42-International Receipts"   
[16] "46-Direct Debits Paid"        "56-Credit Card Receipts"      "57-Inter Branch"              "58-Unpaid Items"              "59-Inter Company Transfers"  
[21] "6-Notified Interest Credited" "61-Domestic"                  "64-Charge Refund"             "66-Inter Company Transfers"   "67-Suppliers"                
[26] "68-Payroll"                   "69-Domestic"                  "73-Credit Card Payments"      "82-CHAPS Fee"                 "Uncategorised"   

> levels(testdata$category)
 [1] "16-Merchant Service Charge"   "17-Unpaid Cheque Fee"         "18-Gov. Stamp Duty"           "Misc"                         "26-Standard Transfer Charge" 
 [6] "29-Bank Giro Credit"          "3-Cheques Debit"              "32-Standing Order - Debit"    "33-Inter Branch Payment"      "34-International"            
[11] "35-Point of Sale"             "39-Direct Debits Received"    "4-Notified Bank Fees"         "40-Cash Lodged"               "42-International Receipts"   
[16] "46-Direct Debits Paid"        "56-Credit Card Receipts"      "57-Inter Branch"              "58-Unpaid Items"              "59-Inter Company Transfers"  
[21] "6-Notified Interest Credited" "61-Domestic"                  "64-Charge Refund"             "66-Inter Company Transfers"   "67-Suppliers"                
[26] "68-Payroll"                   "69-Domestic"                  "73-Credit Card Payments"      "82-CHAPS Fee"                 "Uncategorised"

【问题讨论】：

在您的错误中，category 拼写为 catgeory。如果问题不相关，identical(levels(predictionsTree),levels(testdata$category)) 的输出是什么？
嗨，谢谢你，我修改了愚蠢的拼写错误....doh！！！我运行了相同的函数，它输出 [1] TRUE .........现在当我运行 confusionMatrix 函数时出现以下错误......表中的错误（数据，参考，dnn = dnn, ...) : 所有参数的长度必须相同
检查另一个拼写错误的catgeory，检查length(testdata$category) 和length(predictionsTree，同时检查你对这两个向量的总结。如果你只想有一个简单的混淆矩阵：table(predictionsTree,testdata$category)

标签： r machine-learning classification r-caret

【解决方案1】：

尝试使用：

confusionMatrix(table(Argument 1, Argument 2))

这对我有用。

【讨论】：

【解决方案2】：

也许您的模型没有预测某个因素。使用table() 函数而不是confusionMatrix() 来查看是否是问题所在。

【讨论】：

您可以将此添加为评论。
我发现这很有帮助，但现在我想知道，两者之间似乎没有太大区别。它只是图形吗？
如果是这种情况，那么，我们如何才能优雅地修复或解决它？

【解决方案3】：

尝试为na.action 选项指定na.pass：

predictionsTree <- predict(treeFit, testdata,na.action = na.pass)

【讨论】：

【解决方案4】：

将它们变成数据框，然后在confusionMatrix函数中使用：

pridicted <- factor(predict(treeFit, testdata))
real <- factor(testdata$catgeory)

my_data1 <- data.frame(data = pridicted, type = "prediction")
my_data2 <- data.frame(data = real, type = "real")
my_data3 <- rbind(my_data1,my_data2)

# Check if the levels are identical
identical(levels(my_data3[my_data3$type == "prediction",1]) , levels(my_data3[my_data3$type == "real",1]))

confusionMatrix(my_data3[my_data3$type == "prediction",1], my_data3[my_data3$type == "real",1],  dnn = c("Prediction", "Reference"))

【讨论】：

【解决方案5】：

可能是testdata中缺少值，在“predictionsTree

testdata <- testdata[complete.cases(testdata),]

【讨论】：

【解决方案6】：

您遇到的长度问题可能是由于训练集中存在 NA - 要么丢弃不完整的案例，要么进行估算，以便您没有缺失值。

【讨论】：

【解决方案7】：

我有同样的问题，但在像这样读取数据文件后继续并更改它..

data = na.omit(data)

感谢大家的指点！

【讨论】：

【解决方案8】：

确保您安装了包含所有依赖项的软件包：

install.packages('caret', dependencies = TRUE)

confusionMatrix( table(prediction, true_value) )

【讨论】：

【解决方案9】：

如果您的数据包含 NA，那么有时它会被视为因子水平，因此最初忽略这些 NA

DF = na.omit(DF)

那么，如果你的模型拟合预测了一些不正确的水平，那么最好使用表格

confusionMatrix(table(Arg1, Arg2))

【讨论】：

【解决方案10】：

我刚刚遇到了同样的问题，我通过使用 R 有序因子数据类型解决了它。

levels <- levels(predictionsTree)
levels <- levels[order(levels)]    
table(ordered(predictionsTree,levels), ordered(testdata$catgeory, levels))

【讨论】：