带 TPOT 分类器的 Shap 或 Lime答案

【问题标题】：Shap or Lime with TPOT classifier带 TPOT 分类器的 Shap 或 Lime
【发布时间】：2021-07-27 09:06:31
【问题描述】：

您将如何将 shap 或 Lime 或任何其他模型可解释性工具与 TPOT 导出管道一起使用？例如，这里有一些 shap 库的代码，但您不能将 TPOT 管道传递给它。你会在那里传递什么？

explainer = shap.Explainer(model)
shap_values = explainer(X)

【问题讨论】：

标签： python scikit-learn shap tpot

【解决方案1】：

解决方案 1：

要使用 SHAP 解释 scikit-learn 管道，TPOT 优化过程的结果模型对象，您需要指示 SHAP 使用名为最终估计器（分类器/回归器步骤）的管道，并且您需要使用任何管道转换器步骤（即：预处理器或特征选择器），然后将其提供给 SHAP 解释器。

import numpy as np
import pandas as pd
import shap
from sklearn.datasets import load_iris
from tpot import TPOTClassifier

#Let's use the Iris dataset

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

tpot = TPOTClassifier(generations=3, population_size=25, verbosity=3, random_state=42)
tpot.fit(X, y)

#Inspect resulting Pipeline. Great, 2 steps in the Pipeline: one selector and then the classifier.

tpot.fitted_pipeline_

Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.05)),
                ('logisticregression',
                 LogisticRegression(C=10.0, random_state=42))])

# Before feeding your data to the explainer, you need to transform the data up to the Pipeline step before the classifier step. 
# Beware that in this case it's just one step, but could be more.

shap_df = pd.DataFrame(tpot.fitted_pipeline_.named_steps["variancethreshold"].transform(X), columns=X.columns[tpot.fitted_pipeline_.named_steps["variancethreshold"].get_support(indices=True)])

# Finally, instruct the SHAP explainer to use the classifier step with the transformed data

shap.initjs()
explainer = shap.KernelExplainer(tpot.fitted_pipeline_.named_steps["logisticregression"].predict_proba, shap_df)
shap_values = explainer.shap_values(shap_df)

#Plot summary
shap.summary_plot(shap_values, shap_df)

解决方案 2：

显然 scikit-learn Pipeline predict_proba() 函数将执行解决方案 1 中刚刚描述的操作（即：转换数据，并将 predict_proba 与最终估计器一起应用。

从这个意义上说，这也应该对你有用：

import numpy as np
import pandas as pd
import shap
from sklearn.datasets import load_iris
from tpot import TPOTClassifier

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

tpot = TPOTClassifier(generations=10, population_size=50, verbosity=3, random_state=42, template='Selector-Transformer-Classifier')
tpot.fit(X, y)

#Resulting Pipeline
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.0001)),
                ('rbfsampler', RBFSampler(gamma=0.8, random_state=42)),
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=False, criterion='entropy',
                                        max_features=0.5, min_samples_leaf=10,
                                        min_samples_split=12,
                                        random_state=42))])

explainer = shap.KernelExplainer(tpot.fitted_pipeline_.predict_proba, X)
shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values, X)

补充说明

如果您使用基于树的模型，您可以使用TreeExplainer，它必须比通用的KernelExplainer 更快。根据文档，支持 LightGBM、CatBoost、Pyspark 和大多数基于树的 scikit-learn 模型。

【讨论】：