【问题标题】:Why does LightGBM regression give zero SHAP mean values?为什么 LightGBM 回归的 SHAP 平均值为零?
【发布时间】:2021-04-15 10:47:23
【问题描述】:

正如您在 SHAP 瀑布图中看到的那样,值为零,这是什么原因?零值是否合理?

这是我的数据的链接: https://github.com/kilickursat/Tunnelling/blob/main/TBM_Performance.xlsx

这是我的代码:

import numpy as np
import pandas as pd
import lightgbm
from sklearn.metrics import r2_score, mean_squared_error as MSE
from lightgbm import LGBMRegressor
import shap
import io

df2 = pd.read_excel(io.BytesIO(uploaded['TBM_Performance.xlsx'])) #Colab used
df2["ROCK_PRO"] = df2["UCS(MPa)"] / df2["BTS(MPa)"]
X = df2[["UCS(MPa)", "BTS(MPa)","Fs(m)","Alpha(degree)","PI(kN/mm)","ROCK_PRO"]]
y = df2[["ROP(m/hr)"]]
print(df2)
print(X,y)

hyper_params = {
    'task': 'train',
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': "mse"
}


# train an LightGBM model
model = lightgbm.LGBMRegressor(**hyper_params).fit(X, y)
explainer = shap.Explainer(model)
    
# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])
[![enter image description here][2]][2]


from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

X = pd.DataFrame(np.c_[df2['PI(kN/mm)'],df2["ROCK_PRO"],df2["BTS(MPa)"]], columns = ['PI(kN/mm)', "ROCK_PRO", "BTS(MPa)"])
y = df2['ROP(m/hr)']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=42)


model= LGBMRegressor(**hyper_params,min_data_in_leaf=0,
min_sum_hessian_in_leaf=0.0).fit(X_train, y_train)

predictions = model.predict(X_test)
r2_score(predictions, y_test).round(2)
#R2_score : 0.96

【问题讨论】:

  • 感谢@Flavia Giammarino 的编辑。

标签: python lightgbm shap


【解决方案1】:

SHAP 值全部为零,因为您的模型返回恒定预测,因为所有样本最终都在一个叶子中。这是因为在您的数据集中只有 18 个样本,并且默认情况下 LightGBM 需要在给定叶子中至少有 20 个样本(min_data_in_leaf 默认设置为 20)。如果您将min_data_in_leaf 设置为较小的值,例如 3,那么您的模型将针对不同样本返回不同的预测,并且 SHAP 值将不为零。

import pandas as pd
from lightgbm import LGBMRegressor
import shap

# import the data
df = pd.read_excel('TBM_Performance.xlsx') 
df['ROCK_PRO'] = df['UCS(MPa)'] / df['BTS(MPa)']
print(df.shape[0])
# 18

# extract the features and target
X = df[['UCS(MPa)', 'BTS(MPa)', 'Fs(m)', 'Alpha(degree)', 'PI(kN/mm)', 'ROCK_PRO']]
y = df[['ROP(m/hr)']]

# train the model with min_data_in_leaf=20
hyper_params = {
    'task': 'train',
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': 'mse',
}

model = LGBMRegressor(**hyper_params).fit(X, y)
print(model.predict(X))
# [2.52277776 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776
#  2.52277776 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776
#  2.52277776 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776]

# train the model with min_data_in_leaf=3
hyper_params = {
    'task': 'train',
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': 'mse',
    'min_data_in_leaf': 3,
}

model = LGBMRegressor(**hyper_params).fit(X, y)
print(model.predict(X))
# [2.21428748 2.21428748 2.21428748 2.68171691 2.36794282 2.37986215
#  2.37986215 2.77942405 2.84938042 2.84938042 2.8104722  2.8104722
#  2.50056257 2.47946274 2.46754341 2.58446466 2.58446466 2.24212594]

explainer = shap.Explainer(model)
shap_values = explainer(X)
shap.plots.waterfall(shap_values[0])

【讨论】:

  • 我用 3 试过了,SHAP 图没问题。但是,r2_score 的值为 -0.4。当我应用 min_data_in_leaf=0 和 min_sum_hessian_in_leaf=0.0 时,r2_score 为 0.96。那是下一个困惑。
  • 用你的编辑代码,r2 分数是负0.4,那是为什么呢?可以回复一下吗?
  • 这是一个不同的问题,你应该问另一个问题。
  • 谢谢,Flavia,我会问一个新问题。
  • 嗨,弗拉维亚,我在新帖子中提出了我的问题,请看一下。
猜你喜欢
  • 1970-01-01
  • 2021-06-17
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-06-17
  • 2021-02-23
相关资源
最近更新 更多