为什么 gridsearch 的最佳分数和具有最佳参数的模型的分数不同？答案

【问题标题】：Why does the best score from gridsearch and score from the model with the best parameters differ?为什么 gridsearch 的最佳分数和具有最佳参数的模型的分数不同？
【发布时间】：2022-12-23 01:40:01
【问题描述】：

我正在使用带有预定义拆分的网格搜索。我想根据验证数据集的 MSE 分数为我的模型选择最佳超参数。这是我的代码：

data = pd.read_csv('data/concrete.csv').astype(float)
X = data.drop('concrete_compressive_strength', axis=1)
y = data.concrete_compressive_strength
n = len(X)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=int(n*0.15), random_state=0xC0FFEE)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, 
                                                  test_size=int(n*0.15), random_state=0xC0FFEE)
### BEGIN Solution (do not delete this comment)
validation_indexies = [0 if index in X_val.index else -1 for index in X_train_val.index]
validation_split = PredefinedSplit(validation_indexies)
score = make_scorer(mse)
rf_params = {'n_estimators' : np.linspace(100, 1000, num = 10).astype(int),
             'max_features': ['auto', 'sqrt'],
             'max_depth': np.linspace(10, 100, num = 10).astype(int)}

rf_regressor = GridSearchCV(estimator = RandomForestRegressor(random_state = 2022, n_jobs = -1), 
                          cv = validation_split, 
                          param_grid = rf_params, 
                          scoring = score, 
                          n_jobs = -1)

rf_regressor.fit(X_train_val, y_train_val) # use these datasets because work with predefined split
#refit the model manually because in gridsearch refit method X_train_val will be used, but I need to train model 
#on X_train dataset
random_forest = RandomForestRegressor(**rf_regressor.best_params_, random_state = 2022, n_jobs = -1)
random_forest.fit(X_train, y_train)
print(f'Random forest best parameters: {rf_regressor.best_params_}')
print(f'Random forest MSE on validation: {mse(random_forest.predict(X_val), y_val)}')
print(f'Random forest MSE on train: {mse(random_forest.predict(X_train), y_train)}')
print(f'Random forest MSE on test: {mse(random_forest.predict(X_test), y_test)}')
print(f'Grid search best score {rf_regressor.best_score_}')
### END Solution (do not delete this comment)

这是一个输出：

    Random forest best parameters: {'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 700}
Random forest MSE on validation: 23.70519021501106
Random forest MSE on train: 9.496448922692428
Random forest MSE on test: 29.05420154977391
Grid search best score 24.03263333882673

我的问题是为什么具有最佳参数的随机森林的 MSE（我通过网格搜索在其上转换超参数的验证数据集的 MSE）与网格 search.best_params_ 不同？

【问题讨论】：

检查rf_regressor.cv_results_？当训练不一致时，它可能与样本的排序有关。（与问题无关，但请注意网格搜索试图最大化它的分数，所以你得到最差参数而不是最好的。在搜索中使用scoring='neg_mean_squared_error'，或在make_scorer中使用greater_is_better=False。）

标签： python scikit-learn grid-search train-test-split mse

【解决方案1】：

最佳分数是最佳超参数搜索的“best_estimator 的平均交叉验证分数”。 RandomisedGridsearchCV 调整超参数并选择得分最高的模型。选择基于遗漏折叠的分数，而不是训练分数。最佳估计器的遗漏折叠分数最高，这意味着该模型的未见数据集和训练数据集的分数之间的差异小于其余训练模型。换句话说，该方法选择过拟合程度最低的模型。

另一方面，模型的得分是由自变量 (x) 解释的因变量 (y) 的方差比例。因此，它越接近 1，回归线与数据的拟合越好，模型也越好。

【讨论】：