【问题标题】:Overcoming compatibility issues with using iml from h2o models克服使用来自 h2o 模型的 iml 的兼容性问题
【发布时间】:2021-12-24 01:16:41
【问题描述】:

我无法重现我能找到的将 h2o 与 iml (https://www.r-bloggers.com/2018/08/iml-and-h2o-machine-learning-model-interpretability-and-feature-explanation/) 结合使用的唯一示例,详见此处 (Error when extracting variable importance with FeatureImp$new and H2O)。谁能指出将 iml 与 h2o 结合使用的解决方法或其他示例?


可重现的例子:

library(rsample)   # data splitting
library(ggplot2)   # allows extension of visualizations
library(dplyr)     # basic data transformation
library(h2o)       # machine learning modeling
library(iml)       # ML interprtation
library(modeldata) #attrition data 


# initialize h2o session
h2o.no_progress()
h2o.init()

# classification data
data("attrition", package = "modeldata")
df <- rsample::attrition %>% 
  mutate_if(is.ordered, factor, ordered = FALSE) %>%
  mutate(Attrition = recode(Attrition, "Yes" = "1", "No" = "0") %>% factor(levels = c("1", "0")))

# convert to h2o object
df.h2o <- as.h2o(df)

# create train, validation, and test splits
set.seed(123)
splits <- h2o.splitFrame(df.h2o, ratios = c(.7, .15), destination_frames = 
    c("train","valid","test"))
names(splits) <- c("train","valid","test")

# variable names for resonse & features
y <- "Attrition"
x <- setdiff(names(df), y) 

# elastic net model 
glm <- h2o.glm(
  x = x, 
  y = y, 
  training_frame = splits$train,
  validation_frame = splits$valid,
  family = "binomial",
  seed = 123
  )

# 1. create a data frame with just the features
features <- as.data.frame(splits$valid) %>% select(-Attrition)

# 2. Create a vector with the actual responses
response <- as.numeric(as.vector(splits$valid$Attrition))

# 3. Create custom predict function that returns the predicted values as a
#    vector (probability of purchasing in our example)
pred <- function(model, newdata)  {
  results <- as.data.frame(h2o.predict(model, as.h2o(newdata)))
  return(results[[3L]])
}

# create predictor object to pass to explainer functions
predictor.glm <- Predictor$new(
  model = glm, 
  data = features, 
  y = response, 
  predict.fun = pred,
  class = "classification"
  )

imp.glm <- FeatureImp$new(predictor.glm, loss = "mse")

得到的错误:

Error in `[.data.frame`(prediction, , self$class, drop = FALSE): undefined columns 
selected

traceback()

1. FeatureImp$new(predictor.glm, loss = "mse")

2. .subset2(public_bind_env, "initialize")(...)

3. private$run.prediction(private$sampler$X)

4. self$predictor$predict(data.frame(dataDesign))

5. prediction[, self$class, drop = FALSE]

6. `[.data.frame`(prediction, , self$class, drop = FALSE)

7. stop("undefined columns selected")

【问题讨论】:

    标签: machine-learning h2o iml dalex


    【解决方案1】:

    iml package documentation 中,它说class 参数是“要返回的类列。”。当您设置class = "classification" 时,它正在寻找一个名为“分类”的列,但未找到。至少在 GitHub 上,看起来iml package 自那篇博文以来已经经历了相当多的开发,所以我想某些功能可能不再向后兼容了。

    阅读完包文档后,我想您可能想尝试以下方法:

    predictor.glm <- Predictor$new(
      model = glm, 
      data = features, 
      y = "Attrition",
      predict.function = pred,
      type = "prob"
      )
    
    # check ability to predict first
    check <- predictor.glm$predict(features)
    print(check)
    

    利用 H2O 围绕机器学习可解释性的广泛功能可能会更好。

    h2o.varimp(glm) 将为用户提供每个特征的可变重要性

    h2o.varimp_plot(glm, 10) 将呈现一个图形,显示每个特征的相对重要性。

    h2o.explain(glm, as.h2o(features)) 是可解释性接口的包装器,默认情况下会提供混淆矩阵(在这种情况下)以及变量重要性和每个特征的部分依赖图。

    对于某些算法(例如,基于树的方法),h2o.shap_explain_row_plot()h2o.shap_summary_plot() 将提供 shap 贡献。

    h2o-3 docs 在这里可能对探索更多有用

    【讨论】:

    • H2O-3 在最近的版本中也支持排列变量重要性(docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/…)。要添加到这个答案中的一件事是对分类任务使用不同于“mse”的损失(例如,FeatureImp$new(predictor.glm, loss = "f1") 对我有用)。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-12-14
    • 2016-02-05
    • 1970-01-01
    • 1970-01-01
    • 2010-11-13
    相关资源
    最近更新 更多