【发布时间】:2020-10-20 21:06:00
【问题描述】:
我正在使用随机森林/逻辑回归模型进行预测。我的部分研究是创建一个“新”数据框来模拟新患者,并预测他们在手术后 30 天内经历死亡的可能性。在执行双交叉验证以获得准确度评级后,我目前正在将我的数据拟合到完整数据集上:
#Logistic Regression Model:
fullModelMort = glm(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,family="binomial")
#Random Forest Model:
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05))
然后,我将创建“新患者”以输入我的模型,以根据这些输入进行预测和预测死亡概率:
#New Patients for Predictions
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
#Predict using LR Model
Patient1 = predict(fullModelMort, newPatient1, type="response")
Patient2 = predict(fullModelMort, newPatient2, type="response")
#Classify whether a patient is High/Low Risk based on probability for mortality:
Determine_Mortality = function(prediction){
if(prediction > .05){
Response=paste("High Risk:", round(prediction*100,2) ,"% Chance of Mortality")
return(Response)
}
else{
Response=paste("Low Risk:", round(prediction*100,2) ,"% Chance of Mortality")
return(Response)
}
}
print(paste0("Patient 1 Results - ", Determine_Mortality(Patient1)))
print(paste0("Patient 2 Results - ", Determine_Mortality(Patient2)))
这部分适用于逻辑回归模型,但是,当我尝试为我的随机森林模型做同样的事情时,我收到以下错误:
Error in predict.randomForest(surgery.bag, newPatient1, type = "response") : Type of predictors in new data do not match that of the training data.
这是我的随机森林预测代码:
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
test1=predict(surgery.bag,newPatient1,type="response")
用于拟合模型的数据集摘要(尽管在拟合中仅使用了这些列的子集)
ahrq_ccs age gender race asa_status bmi baseline_cancer baseline_cvd baseline_dementia
Arthroplasty knee : 3032 Min. : 1.00 F:15279 African American: 3416 I-II :15244 Min. : 2.15 No :18593 No :13947 No :28087
Nephrectomy; partial or complete : 2559 1st Qu.:48.30 M:13008 Caucasian :23768 III :12142 1st Qu.:24.61 Yes: 9694 Yes:14340 Yes: 200
Spinal fusion : 2377 Median :58.60 Other : 1103 IV-VI: 901 Median :28.20
Open prostatectomy : 2356 Mean :57.71 Mean :29.47
Colorectal resection : 2269 3rd Qu.:68.30 3rd Qu.:32.84
Hysterectomy; abdominal and vaginal: 2253 Max. :90.00 Max. :92.59
(Other) :13441
baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary baseline_charlson mortality_rsi complication_rsi ccsMort30Rate ccsComplicationRate
No :24582 No :22021 No :23195 No :25639 No :25202 Min. : 0.000 Min. :-4.4000 Min. :-4.7200 Min. :0.000000 Min. :0.01612
Yes: 3705 Yes: 6266 Yes: 5092 Yes: 2648 Yes: 3085 1st Qu.: 0.000 1st Qu.:-1.2400 1st Qu.:-0.8600 1st Qu.:0.000789 1st Qu.:0.08198
Median : 0.000 Median :-0.3000 Median :-0.3100 Median :0.002764 Median :0.10937
Mean : 1.178 Mean :-0.5385 Mean :-0.4258 Mean :0.004328 Mean :0.13322
3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.007398 3rd Qu.:0.18337
Max. :13.000 Max. : 4.8300 Max. :12.5600 Max. :0.016673 Max. :0.46613
hour dow month moonphase mort30 complication
Min. : 6.000 Fri:5351 Jun : 2845 First Quarter:7126 0:28170 0:24542
1st Qu.: 7.000 Mon:6223 Aug : 2734 Full Moon :7175 1: 117 1: 3745
Median : 9.000 Thu:4936 Mar : 2587 Last Quarter :7159
Mean : 9.854 Tue:6258 Apr : 2547 New Moon :6827
3rd Qu.:12.000 Wed:5519 Jan : 2534
Max. :18.000 May : 2524
(Other):12516
我很好奇是不是因为我的测试数据的水平与用于拟合的数据集的水平不匹配,但我不能确定。
数据集可以下载here 可重现的代码:
library(MASS)
library(ggplot2)
library(Hmisc)
library(corrplot)
library(dplyr)
library(randomForest)
library(tidyr)
#Read in the data set
surgery=read.csv("SurgeryTiming.csv")
#Remove dummy values
surgery$gender[surgery$gender == ""] <- NA
surgery$asa_status[surgery$asa_status == ""] <- NA
surgery$race[surgery$race == ""] <- NA
surgery$bmi[surgery$bmi == ""] <- NA
surgery$hour = as.numeric(sub("\\..*", "", as.character(surgery$hour))) #Split out the base hour of surgery
#Drop NA values
surgery = surgery %>% drop_na(gender)
surgery = surgery %>% drop_na(asa_status)
surgery = surgery %>% drop_na(race)
surgery = surgery %>% drop_na(age)
surgery = surgery %>% drop_na(bmi)
#Drop additional levels that now have no values
surgery$gender = droplevels(surgery$gender)
surgery$asa_status = droplevels(surgery$asa_status)
surgery$race = droplevels(surgery$race)
#View our numeric data distributions
num_data <- surgery[,sapply(surgery,is.numeric)]
hist.data.frame(num_data)
surgery$complication=revalue(surgery$complication,c("Yes"=1))
surgery$complication=revalue(surgery$complication,c("No"=0))
surgery$mort30=revalue(surgery$mort30,c("Yes"=1))
surgery$mort30=revalue(surgery$mort30,c("No"=0))
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05)) #The cutoff is the probability for each group selection, probs of 1% or higher are classified as 'Mortality' occuring
test1=predict(surgery.bag,newPatient1,type="response")
非常感谢任何建议/建议。
【问题讨论】:
-
这能回答你的问题吗? Type Mismatch Error using randomForest in R
-
进入模型的数据与 predict 中指定的“newdata”之间的所有名称、变量类型和级别必须相同。使用
str查看 data.frames 以验证此信息。 -
@JeffreyEvans 如果我的预测变量是整个数据集的子集,我是否只需要为这些变量创建级别?我刚刚更新了我的代码以包含我在预测中使用的变量的所有不同级别,但仍然得到相同的错误。
-
您可能在子集中有一个级别,而该级别不在传递给 randomForest 的数据中。这肯定会抛出这个错误。它不应该在另一个方向(模型中不在新数据中的级别)抛出错误,但是,你永远不知道。
标签: r machine-learning predict