在来自不同版本 XGBoost 的数据库上拟合的回归模型上获得的不同结果答案

【问题标题】：Different results obtained on a regression model fitted on a database from different versions of XGBoost在来自不同版本 XGBoost 的数据库上拟合的回归模型上获得的不同结果
【发布时间】：2021-12-02 10:56:48
【问题描述】：

我在 python 中编写了一个代码来使用 XGBoost 进行一些回归工作，但是当我在两台不同的计算机上运行代码时，使用两个不同版本的 XGBoost 和 Python，结果截然不同。我的代码很长，但我想在这里展示其中的一些部分。我在这里介绍的部分是使用 xgb.cv() 命令进行超参数调整，以及使用 Scikit's XGBRegressor 进行拟合和预测，并获得优化的参数通过超参数调优。将调整的参数存储在以下列表中，并带有一个初始值：

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing
from model_functions import GaussRankScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt
import xgboost as xgb
from xgboost import XGBRegressor
import shap
import operator
from sklearn.model_selection import GridSearchCV
import joblib
import plotly.graph_objs as go
import scipy as sp
import seaborn as sns
from numpy import asarray
from sklearn.svm import SVR
from scipy.stats import ttest_ind
from sklearn.impute import SimpleImputer
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
import scipy.stats as stats
from yellowbrick.regressor import residuals_plot, ResidualsPlot
import sys
from scipy.stats import pearsonr

params = {
    'max_depth': 6,
    'min_child_weight': 1,
    'learning_rate': 0.3,
    'subsample': 1,
    'colsample_bytree': 1,
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'booster': 'gbtree',
    'nthread': -1,
    'validate_parameters':'True',
    'alpha': 0.2,
    'lambda': 0.001,
    'colsample_bylevel': 0.9,
    'verbose': 0,
    'gamma': 0.01,
    'max_delta_step': 0.1,
    'silent': 0
}

使用for循环如下所述进行参数调整。在每个循环中，除了learning rate 和gamma 之外，两个参数都进行了调整，它们分别进行了优化。每两个参数都将在 for 循环中进行优化，并且参数列表将在每个循环结束时使用为它们优化的最佳值进行更新。循环相似，它们之间的唯一区别是优化的参数。 xgb.cv() 用于过程的交叉验证部分。用于为每个参数选择最佳值的评估指标是 RMSE。以下是负责优化learning rate（又名eta）的循环：

df_x = dfnum.iloc[:,:-1]

df_y = dfnum.iloc[:,-1]

X_train, X_test, y_train, y_test= train_test_split(df_x, df_y,
                                                   test_size=0.1,
                                                   random_state=42)    

"Converting features' distributions to normal distribution"

gauss_scaler = GaussRankScaler()

X_trainnum = gauss_scaler.fit_transform(X_train)

X_testnum = gauss_scaler.transform(X_test)

"Scaling all the features to be between 0 and 1"    

scaler = preprocessing.MinMaxScaler()
    
X_trainnum = scaler.fit_transform(X_trainnum)

X_testnum = scaler.transform(X_testnum)


num_boost_round = 999
    
dtrain = xgb.DMatrix(X_trainnum, label=y_train)
dtest = xgb.DMatrix(X_testnum, label=y_test)

min_rmse = float("Inf")
best_params = None

for learning_rate in [.3, .2, .1, .05, .01, .005]:

    params['learning_rate'] = learning_rate
    
    cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round=num_boost_round,
            seed=42,
            nfold=3,
            metrics=['rmse'],
            early_stopping_rounds=10
          )

    mean_rmse = cv_results['test-rmse-mean'].min()
    boost_rounds = cv_results['test-rmse-mean'].argmin()

    if mean_rmse < min_rmse:
        min_rmse = mean_rmse
        best_params = learning_rate

print('')
print("Best parameter: learning_rate = {}, RMSE: {}".format(best_params, min_rmse))
print('')

params['learning_rate'] = best_params

在以上述方式调整所有参数后，更新和优化的参数列表通过 XGBRegressor 并将模型拟合到手头的数据库中：

print('Fitting the model')

best_model = XGBRegressor(**params,early_stopping_rounds=10,num_boost_round=999)

best_model.fit(X_trainnum, y_train)

joblib.dump(best_model,'best_model_grid')

y_pred = best_model.predict(X_testnum)

y_pred1 = best_model.predict(X_trainnum)

我在我的两台机器（个人笔记本电脑和办公室 PC）上都使用 Python 和 XGBoost 到 Anaconda。我笔记本电脑上的 XGBoost 版本是 0.90，Python 版本是 3.7.10。另一方面，我的办公室 PC 运行 3.8.11 版本的 Python 和 1.42 版本的 XGBoost。

在我的个人笔记本电脑上使用旧版本的 Python 和 XGBoost 运行我的代码时，代码运行顺畅，没有任何警告或错误。但是，当它在我的办公室 PC 上使用较新版本的 Python 和 XGBoost 运行时，在包含 xgb.cv() 命令，用于进行超参数调整，我收到以下错误消息：

[13:38:44] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573: 
Parameters: { "early_stoppage" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.

然后错误消息将变为：

Hyperparameter tuning.
[13:38:47] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573: 
Parameters: { "silent", "verbose" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.

最后，当模型拟合 XGBRegressor 时，它变为：

[15:08:00] WARNING: D:\bld\xgboost-split_1631904903843\work\src\learner.cc:573: 
Parameters: { "early_stopping_rounds", "num_boost_round", "silent", "verbose" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.

在本项目中使用的算法和编程语言的旧版本上获得的结果比通过新版本获得的结果要好得多。旧版本比新版本产生更好的结果。差异非常显着。我使用的数据库由 11 个数字特征和一个数字目标特征组成。

我在这个网站和其他来源上进行了研究和浏览，并就此向许多数据分析师专家寻求帮助，但遗憾的是我无法找到解决方案或导致此问题的原因。

如果有人能帮助我解决这个问题，我将非常感激和感激

【问题讨论】：

从警告中得知，虽然保留了旧版本的参数，但在新版本中没有使用，这就是你收到警告但没有收到错误的原因。并且因为没有使用参数，所以新版本的代码是无效的。您可以查看新旧源代码以确认错误。
@lazy 我已经删除了这些参数。我的模型现在工作得更好，但它过拟合了。我怀疑这是因为没有“提前停止”和“增强轮数”，我不知道在 xgb.train() 的新版本中实施它是不是很热

标签： python xgboost hyperparameters

【解决方案1】：

我将重点介绍这段代码：

best_model = XGBRegressor(**params,early_stopping_rounds=10,num_boost_round=999)

正确的版本应该是：

# Removed `verbose`, `eval_metric`.  Replaced `nthread` with `n_jobs`.
# Replaced objective to "reg:squarederror" since you are using regression instead of classification.
params = {
    'max_depth': 6,
    'min_child_weight': 1,
    'eta': 0.3,
    'subsample': 1,
    'colsample_bytree': 1,
    "objective": "reg:squarederror",
    'booster': 'gbtree',
    'n_jobs': 10,
    'validate_parameters':'True',
    'alpha': 0.2,
    'lambda': 0.001,
    'colsample_bylevel': 0.9,
    'gamma': 0.01,
    'max_delta_step': 0.1,
}

# notice the `n_estimators`
model = XGBRegressor(**params, n_estimators=999)

# Passed `early_stopping_rounds`, `verbose`, `eval_metric` here.
# Replaced the `eval_metric` to `rmse` since you are using regression instead of classification.
# Added `eval_set` since you need to carry out evaluation.
model.fit(
    X,
    y,
    early_stopping_rounds=10,
    verbose=True,
    eval_metric="rmse",
    eval_set=[(X, y)],
)

你可以在这里https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn找到估算器的文档。

【讨论】：