【问题标题】:Classification error Levels in factors of new data do not match original data分类错误 新数据的因子水平与原始数据不匹配
【发布时间】:2018-09-25 17:47:49
【问题描述】:

我有这个数据集:

"chr","start","stop","strand","num_probes","segment_mean","is_nocnv"
chr18,52502759,52502887,*,2,-2.387,YES
chr18,52508963,68598272,*,9546,-0.3843,YES
chrX,17018571,63154896,*,18479,-0.0448,YES
chrX,63161754,63812965,*,265,-0.5375,YES
chrX,63816350,66632343,*,1071,0.1047,YES
chrX,66632547,67941468,*,558,-0.5452,YES
chrX,67947143,94288567,*,10251,-0.0625,YES
chr1,5902314,10246654,*,2415,-0.1312,NO
chr1,10249962,10255256,*,4,-1.4639,NO
chrX,66632547,67941468,*,605,-0.5472,NO
chrX,67947143,90967744,*,11378,-0.0608,NO
chrX,90968512,90971771,*,9,-0.9191,NO
chrX,90971889,92325108,*,520,-0.088,NO
etc...

我写了这段代码:

mydata= read.csv("prova.csv")
str(mydata)
set.seed(1234)
ind <- sample(2,nrow(mydata),replace=TRUE, prob= c(0.7,0.3))
trainData <- mydata[ind==1,]
testData <- mydata[ind==2,]

myFormula <- is_nocnv ~ chr + start + stop + strand + num_probes +     segment_mean
albero <- ctree(myFormula, data=trainData)
#check the prediction
table(predict(albero),trainData$is_nocnv)

然后我有一行新的测试数据集:

"chr","start","stop","strand","num_probes","segment_mean","is_nocnv"
chr18,52502759,52502887,*,2,-2.387,a

我想在这个测试数据集中预测“is_nocnv”的值(“a”不是真正的值)

为了我的目标,我插入以下代码:

testData= read.csv("TEST_DATA.csv")
testPred <- predict(albero,newdata= testData)
table(testPred,testData$is_nocnv)

这里: 我有错误:

> testPred <- predict(albero,newdata= testData)
 Error in checkData(oldData, RET) : 
  Levels in factors of new data do not match original data

不知道为什么?

【问题讨论】:

    标签: r decision-tree


    【解决方案1】:

    您的testDatafactor 变量上与trainData 的级别不同(在您的示例中为chris_nocnv)。

    检查levels(testData$is_nocnv)levels(trainData$is_nocnv)。 ($chr 相同)。

    levels 需要相等。

    从这个字符串:

    chr18,52502759,52502887,*,2,-2.387,a
    

    似乎是is_nocnv = a,但在你的火车上你只有YES/NO 标签。

    确保使用相同的标签和相同的levels

    testData$is_nocnv <- factor("YES", levels = c("NO","YES")) # or "NO"
    

    或更好:

    testData$is_nocnv <- factor("YES", levels = levels(trainData$is_nocnv))
    

    其他变量chr也一样:

    testData$chr <- factor("chr18", levels = levels(trainData$chr))
    

    【讨论】:

    • 感谢您的回答。如果我插入levels(testData) 和levels(trainData) 我得到NULL。相反,这个字符串: testData$is_nocnv
    • 我得到这个:> levels(trainData$is_nocnv) [1] "NO" "YES" and > levels(testData$is_nocnv) [1] "NO" "YES"
    • 这是问题所在:>levels(trainData$chr) [1] "chr1" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" “chr18” “chr19” “chr2” “chr20” “chr21” “chr22” “chr3” “chr4” “chr5” [19] “chr6” “chr7” “chr8” “chr9” “chrX” “chrY” || ||| > levels(testData$chr) [1] "chr18" 但我想对新条目进行分类
    • 表示染色体
    • 在原始文件中我有更多的 chr,在测试文件中我有 1 个我想用分类学习的规则分类的 chr。但问题似乎是在测试文件中我只有一个 chr
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-07-18
    • 2012-02-06
    • 2013-06-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-06-28
    相关资源
    最近更新 更多