R Caret 随机森林 AUC 好得令人难以置信？答案

【问题标题】：R Caret Random Forest AUC too good to be true?R Caret 随机森林 AUC 好得令人难以置信？
【发布时间】：2019-02-04 08:23:28
【问题描述】：

预测建模的相对新手——我的大部分培训/经验都在推理统计中。我试图预测学生在 4 年内大学毕业。

基本问题是我已经完成了数据清理（插补、居中、缩放）；将处理/转换的数据分成训练（70%）和测试（30%）集；使用两种方法平衡数据（因为数据是 65%=0, 35%=1--我发现关于什么归类为不平衡的建议不一致，但一个消息来源建议任何不在 40/60 范围内的东西）--ROSE “两者”和 SMOTE；并运行随机森林。

对于 ROSE“BOTH”模型，我在训练集上的准确度为 0.9242，在测试集上的 AUC 为 0.9268。

对于 SMOTE 模型，我在训练集上的准确度为 0.9943，在测试集上的 AUC 为 0.9971。

关于模型性能的更多细节嵌入在下面复制的代码中。

这似乎好得令人难以置信。但是，根据我在测试集上发现的稍微改进的性能并不表示过度拟合（反之亦然）。那么，这个模型的性能可能真的很好，还是好得令人难以置信？我无法通过 SO 搜索找到这个问题的直接答案。

另外，几周后我将获得另一组数据，我可以在此基础上运行它。我想这可能是另一个“测试”集，对吗？然后我可以将此应用于我们有兴趣了解其 4 年后毕业可能性的最新队列。

非常感谢，布赖恩

#Used for predictive modeling of 4-year graduation

#IMPORT DATA
library(haven)
grad4yr <- [file path]

#DETERMINE DATA BALANCE/UNBALANCE
prop.table(table(grad4yr$graduate_4_yrs))
# 0=0.6492, 1=0.3517


#convert  to factor so next step doesn't impute outcome variable
grad4yr$graduate_4_yrs <- as.factor(grad4yr$graduate_4_yrs)

#Preprocess data, RANN package used
library('RANN')

#Create proprocessed values object which includes centering, scaling, and imputing missing values using KNN
Processed_Values <- preProcess(grad4yr, method = c("knnImpute","center","scale"))

#Create new dataset with imputed values and centering/scaling
    #Confirmed this results in 0 cases with missing values
grad4yr_data_processed <- predict(Processed_Values, grad4yr)

#Confirm last step results in 0 cases with missing values
sum(is.na(grad4yr_data_processed))
#[1] 0

#Convert outcome variable to numeric to ensure dummify step (next) doesn't dummify outcome variable.
grad4yr_data_processed$graduate_4_yrs <- as.factor(grad4yr_data_processed$graduate_4_yrs)

#Convert all factor variables to dummy variables; fullrank used to omit one of new dummy vars in each
#set.
dmy <- dummyVars("~ .", data = grad4yr_data_processed, fullRank = TRUE)

#Create new dataset that has the data imputed AND transformed to have dummy variables for all variables that
#will go in models.
grad4yr_processed_transformed <- data.frame(predict(dmy,newdata = grad4yr_data_processed))

#Convert outcome variable back to binary/factor for predictive models and create back variable with same name
  #not entirely sure who last step created new version of outcome var with ".1" at the end
grad4yr_processed_transformed$graduate_4_yrs.1 <- as.factor(grad4yr_processed_transformed$graduate_4_yrs.1)
grad4yr_processed_transformed$graduate_4_yrs <- as.factor(grad4yr_processed_transformed$graduate_4_yrs)
grad4yr_processed_transformed$graduate_4_yrs.1 <- NULL

#Split data into training and testing/validation datasets based on outcome at 70%/30%
index <- createDataPartition(grad4yr_processed_transformed$graduate_4_yrs, p=0.70, list=FALSE)
trainSet <- grad4yr_processed_transformed[index,]
testSet <- grad4yr_processed_transformed[-index,]


#load caret
library(caret)

#Feature selection using rfe in R Caret, used with profile/comparison
control <- rfeControl(functions = rfFuncs,
                      method = "repeatedcv",
                      repeats = 10,#using k=10 per Kuhn & Johnson pp70; and per James et al pp 
                            #https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
                      verbose = FALSE)


#create traincontrol using repeated cross-validation with 10 fold 5 times
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 5, 
                           search = "random")



#Set the outcome variable object
grad4yrs <- 'graduate_4_yrs'

#set predictor variables object
predictors <- names(trainSet[!names(trainSet) %in% grad4yrs])

#create predictor profile to see what where prediction is best (by num vars)
grad4yr_pred_profile <- rfe(trainSet[,predictors],trainSet[,grad4yrs],rfeControl = control)

# Recursive feature selection
# 
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
# 
# Resampling performance over subset size:
#   
#   Variables Accuracy  Kappa AccuracySD KappaSD Selected
# 4   0.6877 0.2875    0.03605 0.08618         
# 8   0.7057 0.3078    0.03461 0.08465        *
# 16  0.7006 0.2993    0.03286 0.08036         
# 40  0.6949 0.2710    0.03330 0.08157         
# 
# The top 5 variables (out of 8):
#   Transfer_Credits, HS_RANK, Admit_Term_Credits_Taken, first_enroll, Admit_ReasonUT10


#see data structure
str(trainSet)
#not copying output here, but confirms outcome var is factor and everything else is numeric

#given 65/35 split on outcome var and what can find about unbalanced data, considering unbalanced and doing steps to balance.
#using ROSE "BOTH and SMOTE to see how differently they perform. Also ran under/over with ROSE but they didn't perform nearly as
#well so removed from this script.
#SMOTE to balance data on the processed/dummified dataset
library(DMwR)#https://www3.nd.edu/~dial/publications/chawla2005data.pdf for justification
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, perc.over=600, perc.under=100)

#see how balanced SMOTE resulting dataset is
prop.table(table(train.SMOTE$graduate_4_yrs))
#0         1 
#0.4615385 0.5384615 


#open ROSE package/library
library("ROSE")

#ROSE to balance data (using BOTH) on the processed/dummified dataset
train.both <- ovun.sample(graduate_4_yrs ~ ., data=grad4yr_processed_transformed, method = "both", p=.5, 
                               N = 2346)$data
#see how balanced BOTH resulting dataset is
prop.table(table(train.both$graduate_4_yrs))
#0         1 
#0.4987212 0.5012788

#ROSE to balance data (using BOTH) on the processed/dummified dataset
table(grad4yr_processed_transformed$graduate_4_yrs)
#0    1 
#1144  618 



library("caret")
#create random forests using balanced data from above
RF_model_both <- train(train.both[,predictors],train.both[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)

#print info on accuracy & kappa for "BOTH" training model
# print(RF_model_both)
# Random Forest 
# 
# 2346 samples
# 40 predictor
# 2 classes: '0', '1' 
# 
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times) 
# Summary of sample sizes: 2112, 2111, 2111, 2112, 2111, 2112, ... 
# Resampling results across tuning parameters:
#   
#   mtry  Accuracy   Kappa    
# 8    0.9055406  0.8110631
# 11    0.9053719  0.8107246
# 12    0.9057981  0.8115770
# 13    0.9054584  0.8108965
# 14    0.9048602  0.8097018
# 20    0.9034992  0.8069796
# 26    0.9027307  0.8054427
# 30    0.9034152  0.8068113
# 38    0.9023899  0.8047622
# 40    0.9032428  0.8064672

# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 12.


RF_model_SMOTE <- train(train.SMOTE[,predictors],train.SMOTE[, grad4yrs],method = 'rf', trControl = fitControl, ntree=1000, tuneLength = 10)
#print info on accuracy & kappa for "SMOTE" training model
# print(RF_model_SMOTE)
# Random Forest 
# 
# 8034 samples
# 40 predictor
# 2 classes: '0', '1' 
# 
# No pre-processing
# Resampling: Cross-Validated (10 fold, repeated 5 times) 
# Summary of sample sizes: 7231, 7231, 7230, 7230, 7231, 7231, ... 
# Resampling results across tuning parameters:
#   
#   mtry  Accuracy   Kappa    
# 17    0.9449082  0.8899939
# 19    0.9458047  0.8917740
# 21    0.9458543  0.8918695
# 29    0.9470243  0.8941794
# 31    0.9468750  0.8938864
# 35    0.9468003  0.8937290
# 36    0.9463772  0.8928876
# 40    0.9463275  0.8927828
# 
# Accuracy was used to select the optimal model using the largest value.
# The final value used for the model was mtry = 29.

#Given that both accuracy and kappa appear better in the "SMOTE" random forest it's looking like it's the better model.
#But, running ROC/AUC on both to see how they both perform on validation data.


#Create predictions based on random forests above
rf_both_predictions <- predict.train(object=RF_model_both,testSet[, predictors], type ="raw")
rf_SMOTE_predictions <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="raw")

#Create predictions based on random forests above
rf_both_pred_prob <- predict.train(object=RF_model_both,testSet[, predictors], type ="prob")
rf_SMOTE_pred_prob <- predict.train(object=RF_model_SMOTE,testSet[, predictors], type ="prob")



#create Random Forest confusion matrix to evaluate random forests
confusionMatrix(rf_both_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
# 
# Reference
# Prediction   0   1
# 0 315  12
# 1  28 173
# 
# Accuracy : 0.9242          
# 95% CI : (0.8983, 0.9453)
# No Information Rate : 0.6496          
# P-Value [Acc > NIR] : < 2e-16         
# 
# Kappa : 0.8368          
# Mcnemar's Test P-Value : 0.01771         
#                                           
#             Sensitivity : 0.9351          
#             Specificity : 0.9184          
#          Pos Pred Value : 0.8607          
#          Neg Pred Value : 0.9633          
#              Prevalence : 0.3504          
#          Detection Rate : 0.3277          
#    Detection Prevalence : 0.3807          
#       Balanced Accuracy : 0.9268          
#                                           
#        'Positive' Class : 1 

# confusionMatrix(rf_under_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
#Accuracy : 0.8258
  #only copied accuracy as it was fair below two other versions
confusionMatrix(rf_SMOTE_predictions,testSet[,grad4yrs], positive = "1")
#output copied here:
# Confusion Matrix and Statistics
# 
# Reference
# Prediction   0   1
# 0 340   0
# 1   3 185
# 
# Accuracy : 0.9943          
# 95% CI : (0.9835, 0.9988)
# No Information Rate : 0.6496          
# P-Value [Acc > NIR] : <2e-16          
# 
# Kappa : 0.9876          
# Mcnemar's Test P-Value : 0.2482          
# 
# Sensitivity : 1.0000          
# Specificity : 0.9913          
# Pos Pred Value : 0.9840          
# Neg Pred Value : 1.0000          
# Prevalence : 0.3504          
# Detection Rate : 0.3504          
# Detection Prevalence : 0.3561          
# Balanced Accuracy : 0.9956          
# 
# 'Positive' Class : 1 


#put predictions in dataset
testSet$rf_both_pred <- rf_both_predictions#predictions (BOTH)
testSet$rf_SMOTE_pred <- rf_SMOTE_predictions#probabilities (BOTH)
testSet$rf_both_prob <- rf_both_pred_prob#predictions (SMOTE)
testSet$rf_SMOTE_prob <- rf_SMOTE_pred_prob#probabilities (SMOTE)



library(pROC)
#get AUC of the BOTH predictions
testSet$rf_both_pred <- as.numeric(testSet$rf_both_pred)
Both_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
                      predictor = testSet$rf_both_pred,
                      levels = rev(levels(testSet$graduate_4_yrs)))
auc(Both_ROC_Curve)
# Area under the curve: 0.9268

#get AUC of the SMOTE predictions
testSet$rf_SMOTE_pred <- as.numeric(testSet$rf_SMOTE_pred)
SMOTE_ROC_Curve <- roc(response = testSet$graduate_4_yrs,
                      predictor = testSet$rf_SMOTE_pred,
                      levels = rev(levels(testSet$graduate_4_yrs)))

auc(SMOTE_ROC_Curve)
#Area under the curve: 0.9971


#So, the SMOTE balanced data performed very well on training data and near perfect on the validation/test data.
#But, it seems almost too good to be true. 
#Is there anything I might have missed or performed incorrectly?

【问题讨论】：

这不太可能在这里得到一个好的答案；这里有很多代码，没有数据，所以不太可能有人会给你一个答案来显示你的代码有问题。有关预测建模的更一般性问题，您最好关注stats.stackexchange.com，或者datascience.stackexchange.com。
你确定你只在火车上使用 smote 或 rose 吗？你应该只平衡火车。你有data=grad4yr_processed_transformed，但在SMOTE 调用中应该是data=trainSet。
另外，是的，新队列将是检查您的模型预测的完美测试集。
我真的认为你通过平衡整个数据集污染了你的训练集，这样模型已经看到了测试数据，看起来你过拟合了。
LRave。你说得对。我现在正在编辑。感谢您的帮助。

标签： r random-forest r-caret auc

【解决方案1】：

我会发布我的评论作为答案，即使这可能会被迁移。

我真的认为你过度拟合了，因为你已经平衡了整个数据集。相反，您应该只平衡 train set。

这是您的代码：

library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=grad4yr_processed_transformed,
perc.over=600, perc.under=100)

通过这样做，您的 train.SMOTE 现在也包含来自测试集的信息，因此当您在您的 testSet 上进行测试时，模型将已经看到部分数据，这很可能是您的“太好”的结果。

应该是：

library(DMwR)
train.SMOTE <- SMOTE(graduate_4_yrs ~ ., data=trainSet, # use only the train set
perc.over=600, perc.under=100)

【讨论】：

不仅如此，还对所有数据执行了preProcess(grad4yr, method = c("knnImpute","center","scale"))。我也同意 65/35 的数据集不应该是平衡的。