【发布时间】:2021-06-16 04:01:13
【问题描述】:
我想在较小的数据集中重复glmnet 的超参数调整(alpha 和/或 mlr3 到 avoid variability)
在caret 中,我可以使用"repeatedcv" 做到这一点
因为我真的很喜欢mlr3 家庭包,所以我想用它们来进行分析。但是,我不确定如何在mlr3中执行此步骤的正确方法
示例数据
#library
library(caret)
library(mlr3verse)
library(mlbench)
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# get small training data
train.data <- data[1:60,]
由reprex package (v1.0.0) 于 2021 年 3 月 18 日创建
caret 方法(调整 alpha 和 lambda)使用 "cv" 和 "repeatedcv"
trControlCv <- trainControl("cv",
number = 5,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary)
# use "repeatedcv" to avoid variability in smaller data sets
trControlRCv <- trainControl("repeatedcv",
number = 5,
repeats= 20,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary)
# train and extract coefficients with "cv" and different set.seed
set.seed(2323)
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trControlCv,
tuneLength = 10,
metric="ROC"
)
coef(model$finalModel, model$finalModel$lambdaOpt) -> coef1
set.seed(23)
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trControlCv,
tuneLength = 10,
metric="ROC"
)
coef(model$finalModel, model$finalModel$lambdaOpt) -> coef2
# train and extract coefficients with "repeatedcv" and different set.seed
set.seed(13)
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trControlRCv,
tuneLength = 10,
metric="ROC"
)
coef(model$finalModel, model$finalModel$lambdaOpt) -> coef3
set.seed(55)
model <- train(
diabetes ~., data = train.data, method = "glmnet",
trControl = trControlRCv,
tuneLength = 10,
metric="ROC"
)
coef(model$finalModel, model$finalModel$lambdaOpt) -> coef4
由reprex package (v1.0.0) 于 2021-03-18 创建
用交叉验证展示不同的系数,用重复的交叉验证展示相同的系数
# with "cv" I get different coefficients
identical(coef1, coef2)
#> [1] FALSE
# with "repeatedcv" I get the same coefficients
identical(coef3,coef4)
#> [1] TRUE
由reprex package (v1.0.0) 于 2021-03-18 创建
第一个使用cv.glmnet 的mlr3 方法(内部调整lambda)
# create elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# define train task
train.task <- TaskClassif$new("train.data", train.data, target = "diabetes")
# create learner
learner = as_learner(glmnet_lrn)
# train the learner with different set.seed
set.seed(2323)
learner$train(train.task)
coef(learner$model, s = "lambda.min") -> coef1
set.seed(23)
learner$train(train.task)
coef(learner$model, s = "lambda.min") -> coef2
由reprex package (v1.0.0) 于 2021-03-18 创建
通过交叉验证展示不同的系数
# compare coefficients
coef1
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -3.323460895
#> age 0.005065928
#> glucose 0.019727881
#> insulin .
#> mass .
#> pedigree .
#> pregnant 0.001290570
#> pressure .
#> triceps 0.020529162
coef2
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -3.146190752
#> age 0.003840963
#> glucose 0.019015433
#> insulin .
#> mass .
#> pedigree .
#> pregnant .
#> pressure .
#> triceps 0.018841557
由reprex package (v1.0.0) 于 2021-03-18 创建
更新 1:我取得的进展
根据下面的评论和this comment 我可以使用rsmp 和
AutoTuner
这个answer建议不要调cv.glmnet而是glmnet(当时ml3中没有)
第二种mlr3方法使用glmnet(重复alpha和lambda的调整)
# define train task
train.task <- TaskClassif$new("train.data", train.data, target = "diabetes")
# create elastic net regression
glmnet_lrn = lrn("classif.glmnet", predict_type = "prob")
# turn to learner
learner = as_learner(glmnet_lrn)
# make search space
search_space = ps(
alpha = p_dbl(lower = 0, upper = 1),
s = p_dbl(lower = 1, upper = 1)
)
# set terminator
terminator = trm("evals", n_evals = 20)
#set tuner
tuner = tnr("grid_search", resolution = 3)
# tune the learner
at = AutoTuner$new(
learner = learner,
rsmp("repeated_cv"),
measure = msr("classif.ce"),
search_space = search_space,
terminator = terminator,
tuner=tuner)
at
#> <AutoTuner:classif.glmnet.tuned>
#> * Model: -
#> * Parameters: list()
#> * Packages: glmnet
#> * Predict Type: prob
#> * Feature types: logical, integer, numeric
#> * Properties: multiclass, twoclass, weights
未决问题
我如何证明我的第二种方法是有效的,并且我得到不同种子的相同或相似系数? IE。如何提取AutoTuner的最终模型的系数
set.seed(23)
at$train(train.task) -> tune1
set.seed(2323)
at$train(train.task) -> tune2
由reprex package (v1.0.0) 于 2021-03-18 创建
【问题讨论】:
-
你可以在 mlr3 中做同样的事情,见mlr3book.mlr-org.com/resampling.html
-
@LarsKotthoff 感谢您的评论。我相应地调整了我的问题。
-
我不确定您的问题是什么,或者是否已经回答 - 请尝试在未来提出简洁而简短的问题(虽然代表很好!)。您也可以回答自己的问题,如有疑问,请提出新问题。回答你的问题:我已经回答了如何用旧的
mlr[这里]stackoverflow.com/questions/50995525/… 调整glmnet的问题。将其移植到mlr3应该不会那么难。不过我现在没有时间。这有帮助吗? -
感谢您的有用评论。我试图更简洁地说明我取得的进展(由于 cmets)和问题的开放点。