【发布时间】:2021-01-08 14:42:19
【问题描述】:
我在 R 中有多个 lightgbm 模型,我想验证并提取拟合期间使用的变量名称。使用glm 非常简单,但我可以设法找到使用 lightgbm 模型的方法(如果可能,请参阅 here)。
这里有一个可重现的例子,让一切更清楚:
我使用的是 lightgbm 包中的数据:
library(lightgbm)
data(agaricus.train, package = "lightgbm")
我首先运行基本 lgbm 模型:
# formating the data
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)
# running the model
model_lgbm <- lgb.train(
params = params
, data = dtrain
, nrounds = 10L
, valids = valids
, min_data = 1L
, learning_rate = 1.0
, early_stopping_rounds = 5L
)
现在,我可以为glm 做同样的事情:
## preparing the data
dd <- data.frame(label = train$label, as(train$data, "matrix")[,1:10])
## making the model
model_glm <- glm(label ~ ., data=dd, family="binomial")
从glm,有很多方法可以快速找到用于建模的变量,例如最明显的一种:
variable.names(model_glm)
[1] "(Intercept)" "cap.shape.bell" "cap.shape.conical" "cap.shape.convex"
[5] "cap.shape.flat" "cap.shape.knobbed" "cap.shape.sunken" "cap.surface.fibrous"
[9] "cap.surface.grooves" "cap.surface.scaly"
这个功能在lightgbm中没有实现:
variable.names(model_lgbm)
NULL
尝试使用str 进入模型对象并没有帮助:
str(model_lgbm)
Classes 'lgb.Booster', 'R6' <lgb.Booster>
Public:
add_valid: function (data, name)
best_iter: 3
best_score: 0
current_iter: function ()
dump_model: function (num_iteration = NULL, feature_importance_type = 0L)
eval: function (data, name, feval = NULL)
eval_train: function (feval = NULL)
eval_valid: function (feval = NULL)
finalize: function ()
initialize: function (params = list(), train_set = NULL, modelfile = NULL,
lower_bound: function ()
predict: function (data, start_iteration = NULL, num_iteration = NULL,
raw: NA
record_evals: list
reset_parameter: function (params, ...)
rollback_one_iter: function ()
save: function ()
save_model: function (filename, num_iteration = NULL, feature_importance_type = 0L)
save_model_to_string: function (num_iteration = NULL, feature_importance_type = 0L)
set_train_data_name: function (name)
to_predictor: function ()
update: function (train_set = NULL, fobj = NULL)
upper_bound: function ()
Private:
eval_names: l2
get_eval_info: function ()
handle: 8.19470876878865e-316
higher_better_inner_eval: FALSE
init_predictor: NULL
inner_eval: function (data_name, data_idx, feval = NULL)
inner_predict: function (idx)
is_predicted_cur_iter: list
name_train_set: training
name_valid_sets: list
num_class: 1
num_dataset: 2
predict_buffer: list
set_objective_to_none: FALSE
train_set: lgb.Dataset, R6
train_set_version: 1
valid_sets: list
我设法访问使用的变量名称的唯一方法是通过 lgb.importance 函数,但它并不理想,因为计算变量重要性对于大型模型来说可能很慢,而且我什至不确定它会报告所有变量:
lgb.importance(model)$Feature
[1] "odor=none" "stalk-root=club"
[3] "stalk-root=rooted" "spore-print-color=green"
[5] "odor=almond" "odor=anise"
[7] "bruises?=bruises" "stalk-surface-below-ring=scaly"
[9] "gill-size=broad" "cap-surface=grooves"
[11] "cap-shape=conical" "gill-color=brown"
[13] "cap-shape=bell" "cap-shape=flat"
[15] "cap-surface=scaly" "cap-color=white"
[17] "population=clustered"
有没有办法只访问 lightgbm 模型中使用的变量名?谢谢。
【问题讨论】: