【发布时间】:2017-06-16 07:12:22
【问题描述】:
我正在尝试了解如何构建预测模型,最近在 R 中遇到了 xgboost 包,并尝试使用 Titanic 数据集来实现它。我建立了一个模型,现在我想知道如何检测我的模型是否过度拟合以及选择多少轮以及这是基于训练错误还是测试错误。
这是代码:
#Load Dataset
titanic.train <- read.csv("D:/Data/titanic/train.csv")
titanic.test <- read.csv("D:/Data/titanic/test.csv")
PassengerId=titanic.test$PassengerId
head(titanic.train)
#Create columns to distinguish between Train and Test datasets
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#Create a missing column for Test data
titanic.test$Survived <- NA
#Combine Test and Train Datasets
titanic.full <- rbind(titanic.train , titanic.test)
tail(titanic.full)
titanic.full$Name <- as.character(titanic.full$Name)
titanic.full$Title <- sapply(titanic.full$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
titanic.full$Title <- sub(' ','',titanic.full$Title)
titanic.full$Title[titanic.full$Title %in% c('Capt', 'Col' , 'Dr' , 'Don', 'Major', 'Sir' , 'Rev' ,
'Dona', 'Lady', 'the Countess' , 'Jonkheer', 'Master')] <- 'Noble'
titanic.full$Title[titanic.full$Title %in% c('Ms', 'Miss' , 'Mlle')] <- 'Miss'
titanic.full$Title[titanic.full$Title %in% c('Mrs' , 'Mme')] <- 'Mrs'
table(titanic.full$Title)
#Family size 3 and greater are TRUE or 1
titanic.full$Family <- titanic.full$SibSp + titanic.full$Parch + 1
table(titanic.full$Family)
#titanic.full$Family <- titanic.full$Family >= 3
#titanic.full$Family <- as.factor(titanic.full$Family)
#levels(titanic.full$Family) <- c(0,1)
#titanic.full$Family
titanic.full <- titanic.full[c( "Pclass" , "Title" , "Sex" , "Age" , "Family" , "Fare", "SibSp" , "Parch" , "Embarked" , "Survived")]
head(titanic.full)
#Categorical Casting
titanic.full$Title <- as.factor(titanic.full$Title)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)
titanicDummy <- dummyVars("~.",data=titanic.full, fullRank=T)
titanic.full <- as.data.frame(predict(titanicDummy,titanic.full))
print(names(titanic.full))
#Create test and train data sets
titanic.train <- titanic.full[1:891,]
titanic.test <- titanic.full[892:1309,]
#XGBoosting
set.seed(35)
labs <- titanic.train$Survived
names(titanic.full)
dat <- titanic.train[c("Pclass","Title.Mr","Title.Mrs","Title.Noble", "Sex.male","Age", "Family", "Fare", "SibSp","Parch","Embarked.C","Embarked.Q","Embarked.S")]
titdata <- xgb.DMatrix(data = as.matrix(dat), missing = NA, label=as.numeric(labs))
res <- xgb.cv(objective="binary:logistic" , eta=0.1, metric="auc", max_depth = 3,
data = titdata , label=as.numeric(labs) , nrounds = 200 , nfold = 10 , prediction = TRUE)
这是结果,我需要帮助来解释它,以及一些关于增加或减少“eta”和“max_depth”的建议
res
【问题讨论】:
标签: r machine-learning data-science xgboost auc