XGBoost 对 SHAP 值的解释答案

【问题标题】：Explaination of SHAP value from XGBoostXGBoost 对 SHAP 值的解释
【发布时间】：2021-11-12 02:50:21
【问题描述】：

我已经为二进制分类安装了 XGBoost 模型。我正在尝试理解拟合模型并尝试使用SHAP 来解释预测。

但是，我对 SHAP 生成的力图感到困惑。我预计输出值应该小于 0，因为预测概率小于 0.5。但是，SHAP 值显示8.12。

下面是我生成结果的代码。

import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz

print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))

SHAP 版本：0.39.0

XGBoost 版本：1.4.1

# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)

# Read the selected features
with open('feature_list.json', 'r') as file:
    feature_list = json.load(file)
    
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]

# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')

# Model prediction

model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape

(7887, 501)

# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))

预测概率：0.2292

# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())

shap.plots.force(shap_values[xid])

但是，如果我使用 XGBoost 库中的 SHAP 值，我会得到另一个图，这看起来与我的预期相似。

shap.force_plot(
    model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
    model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
    feature_names=feature_names, 
    features=X[xid].toarray()
)

为什么会这样？哪一个应该是正确解释 XGBoost 模型的 SHAP 值？

感谢您的帮助。

跟进来自 @sergey-bushmanov的回复

由于我无法分享自己的数据，因此我使用来自 Kaggle 的 open dataset 重现了这种情况。

这是模型训练的代码：


import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz


# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20

# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
    csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')

# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)

df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)

# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
                             min_df=min_df,
                             binary=binary,
                            ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)

last_params ={
 'lambda': 0.00016096144192346114,
 'alpha': 0.057770973181367063,
 'eta': 0.19258319097144733,
 'gamma': 0.40032424821976653,
 'max_depth': 9,
 'min_child_weight': 5,
 'subsample': 0.31304772813494836,
 'colsample_bytree': 0.4214452441229869,
 'objective': 'binary:logistic',
 'verbosity': 0,
 'n_estimators': 400
}

classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)

# Get the features
feature_names = vectorizer.get_feature_names()

# save model
classifierCV.get_booster().save_model('xgboost.json')

# Save features
import json

with open('feature_list.json', 'w') as file:
    file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))

# save data
save_npz('test_data.npz', X_train)

这个模型的问题仍然存在。

【问题讨论】：

您能否提供一个完整的从头到尾的可重现示例，包括数据，并显示您遇到的“问题”？到目前为止，我看到的是 2 个案例的汇编：一个没有数据但有问题，另一个有数据，但没有具体说明需要解决什么问题。
抱歉回复晚了。我已经把完整的笔记本here 供您参考。谢谢。

标签： python xgboost shap xgbclassifier

【解决方案1】：

解释 XGBoost 模型的正确 SHAP 值应该是哪一个？

让我们猜测您手头有一个二元分类。然后，您在第二个示例中得到的确实是原始 SHAP 值的正确分解：

In [1]: from scipy.special import expit
In [2]: expit(-1.21)
Out[2]: 0.22970105095339813

注意，.2297 与您在您的：

预测概率：0.2292

至于：

为什么会这样？

很可能您在某处打错了字，但为了确保您必须提供一个完全可重现的示例，包括您的数据，因为在代码方面计算 SHAP 值的两种方式都是正确的。

【讨论】：

谢谢谢尔盖。我无法共享我自己的数据集，所以我使用打开的dataset 重现了这种情况