如何使用 GridSearchCV 找到 GBR 的最佳参数？答案

【问题标题】：How to find optimum parameters for GBR using GridSearchCV?如何使用 GridSearchCV 找到 GBR 的最佳参数？
【发布时间】：2020-01-28 19:54:41
【问题描述】：

我在 excel 文件中有34 samples with 4 inputs and one output。我正在使用gradient boost regressor (GBR) 进行预测，并且我想使用grid search method 从Sklearn 使用cross validation 来拆分数据，从而为GBR 找到optimum parameters。我已经实现了这段代码来调整 GBR 参数，但我在下面得到了这个错误。事实上，这段代码是针对classification problem using XGB 的，我修改了这段代码以适应我的回归问题。请你能帮我解决这个错误吗？我做的对不对？

我得到的错误：

ValueError                                Traceback (most recent call last)
<ipython-input-5-4ee3b80c1f07> in <module>()
     23 kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
     24 grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,verbose=1)
---> 25 grid_result = grid_search.fit(X, label_encoded_y)
     26 # summarize results
     27 print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

D:\Anconda\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    637                                   error_score=self.error_score)
    638           for parameters, (train, test) in product(candidate_params,
--> 639                                                    cv.split(X, y, groups)))
    640 
    641         # if one choose to see train score, "out" will contain train score info

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
    330                                                              n_samples))
    331 
--> 332         for train, test in super(_BaseKFold, self).split(X, y, groups):
    333             yield train, test
    334 

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
     93         X, y, groups = indexable(X, y, groups)
     94         indices = np.arange(_num_samples(X))
---> 95         for test_index in self._iter_test_masks(X, y, groups):
     96             train_index = indices[np.logical_not(test_index)]
     97             test_index = indices[test_index]

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in _iter_test_masks(self, X, y, groups)
    632 
    633     def _iter_test_masks(self, X, y=None, groups=None):
--> 634         test_folds = self._make_test_folds(X, y)
    635         for i in range(self.n_splits):
    636             yield test_folds == i

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in _make_test_folds(self, X, y)
    597             raise ValueError("n_splits=%d cannot be greater than the"
    598                              " number of members in each class."
--> 599                              % (self.n_splits))
    600         if self.n_splits > min_groups:
    601             warnings.warn(("The least populated class in y has only %d"

ValueError: n_splits=2 cannot be greater than the number of members in each class.

这是我的尝试

# XGB, Tune n_estimators and max_depth
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib

import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor,GradientBoostingRegressor,
from sklearn.feature_selection import SelectFromModel

from sklearn import preprocessing
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn import ensemble
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

from IPython.core.interactiveshell import InteractiveShell
matplotlib.use('Agg')
from matplotlib import pyplot
import numpy as np
#read data
Data_ini = pd.read_excel('Data - 1 output -Ra-in - Crossvalidation.xlsx').iloc[:,:]  #read data

#encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = GradientBoostingRegressor()
n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
print(max_depth)
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
# plot results
scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))
for i, value in enumerate(max_depth):
    pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))
pyplot.legend()
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators_vs_max_depth.png')

【问题讨论】：

标签： python scikit-learn cross-validation sklearn-pandas grid-search

【解决方案1】：

您收到此错误是因为您将StratifiedKFold 用于回归问题。来自其documentation

此交叉验证对象是返回分层折叠的 KFold 的变体。通过保留每个类的样本百分比来进行折叠。

当没有一个类（在回归问题中，目标值）具有多个实例时，您将收到ValueError。您可以通过

重现此错误

import numpy as np

x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)

kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
kfold.split(x, y)

如果您让其中一个类拥有更多实例，您将不会收到此错误

x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
y[1] = 5

kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
kfold.split(x, y)

要使您的代码正常运行，您只需将StratifiedKFold 替换为Kfold。

编辑

因为neg_log_loss需要predict_proba，GradientBoostingRegressor中没有实现，所以不能作为评分函数使用。本质上，由于您正在训练回归模型，请使用neg_mean_absolute_error 或列出的其他回归指标here

【讨论】：

我按照您的评论将 StratifiedKFold 替换为 Kfold，但出现了这个新错误“AttributeError: 'GradientBoostingRegressor' object has no attribute 'predict_proba'”
@SH_IQ 啊，我忘记了，因为你在做回归，你不能用neg_log_loss来给简历打分。从this sklearn page，您可以看到使用neg_log_loss，predict_proba 是必需的，正如错误所暗示的那样，GradientBoostingRegressor 中未实现。对于回归问题，请改用neg_mean_absolute_error，一切正常。