【发布时间】:2021-06-20 08:30:06
【问题描述】:
我想尝试使用 tidymodels 和 treesnip 包的 LightGBM 算法。 一些预处理...
# remotes::install_github("curso-r/treesnip")
# install.packages("titanic")
library(tidymodels)
library(stringr)
library(titanic)
data("titanic_train")
df <- titanic_train %>% as_tibble %>%
mutate(title=str_extract(Name,"\\w+\\.") %>% str_replace(fixed("."),"")) %>%
mutate(title=case_when(title %in% c('Mlle','Ms')~'Miss',
title=='Mme'~ 'Mrs',
title %in% c('Capt','Don','Major','Sir','Jonkheer', 'Col')~'Sir',
title %in% c('Dona', 'Lady', 'Countess')~'Lady',
TRUE~title)) %>%
mutate(title=as.factor(title),
Survived=factor(Survived,levels = c(0,1),labels=c("no","yes")),
Sex=as.factor(Sex),
Pclass=factor(Pclass)) %>%
select(-c(PassengerId,Ticket,Cabin,Name)) %>%
mutate(Embarked=as.factor(Embarked))
table(df$title,df$Sex)
trnTst <- initial_split(data = df,prop = .8,strata = Survived)
cv.folds <- training(trnTst) %>%
vfold_cv(data = .,v = 4,repeats = 1)
cv.folds
rec <- recipe(Survived~.,data = training(trnTst)) %>%
step_nzv(all_predictors()) %>%
step_knnimpute(Age,neighbors = 3,impute_with = vars(title,Fare,Pclass))
为了检查问题不在数据中,我成功调整了随机森林算法。
m.rf <- rand_forest(trees = 1000,min_n = tune(),mtry = tune()) %>%
set_mode(mode = 'classification') %>%
set_engine('ranger')
wf.rf <- workflow() %>% add_recipe(rec) %>% add_model(m.rf)
(cls <- parallel::makeCluster(parallel::detectCores()-1))
doParallel::registerDoParallel(cl = cls)
tn.rf <- tune_grid(wf.rf,resamples = cv.folds,grid = 20,
metrics = metric_set(accuracy,roc_auc))
doParallel::stopImplicitCluster()
autoplot(tn.rf)
wf.rf <- finalize_workflow(x = wf.rf,parameters = select_best(tn.rf,metric = 'roc_auc'))
res.rf <- fit_resamples(wf.rf,resamples = cv.folds,metrics = metric_set(accuracy,roc_auc))
res.rf %>% collect_metrics()
但 lightGBM 在没有调整和并行处理的情况下会引发错误
根据How to Use Lightgbm with Tidymodels
与 XGBoost 相比,lightgbm 和 catboost 都非常有能力处理分类变量(因子),因此您不需要将变量转换为虚拟变量(一种热编码),实际上您不应该这样做,它让一切变慢,并可能给你带来更差的性能。
library(treesnip) # lightgbm & catboost connector
m.lgbm <- boost_tree() %>% #trees = tune(), min_n = tune()) %>%
set_mode(mode = 'classification') %>%
set_engine('lightgbm')
wf.lgbm <- workflow() %>% add_recipe(rec) %>% add_model(m.lgbm)
res.lgbm <- fit_resamples(wf.lgbm,resamples = cv.folds)
Warning message:
All models failed. See the `.notes` column.
res.lgbm$.notes[[1]]
internal: Error in pkg_list[[1]]: subgroup out of bounds
【问题讨论】:
-
同样的问题,有什么提示吗?
标签: r lightgbm tidymodels