【问题标题】:How to find optimum parameters for GBR using GridSearchCV?如何使用 GridSearchCV 找到 GBR 的最佳参数?
【发布时间】:2020-01-28 19:54:41
【问题描述】:

我在 excel 文件中有34 samples with 4 inputs and one output。我正在使用gradient boost regressor (GBR) 进行预测,并且我想使用grid search methodSklearn 使用cross validation 来拆分数据,从而为GBR 找到optimum parameters。我已经实现了这段代码来调整 GBR 参数,但我在下面得到了这个错误。事实上,这段代码是针对classification problem using XGB 的,我修改了这段代码以适应我的回归问题。请你能帮我解决这个错误吗?我做的对不对?

我得到的错误:

ValueError                                Traceback (most recent call last)
<ipython-input-5-4ee3b80c1f07> in <module>()
     23 kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
     24 grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,verbose=1)
---> 25 grid_result = grid_search.fit(X, label_encoded_y)
     26 # summarize results
     27 print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

D:\Anconda\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    637                                   error_score=self.error_score)
    638           for parameters, (train, test) in product(candidate_params,
--> 639                                                    cv.split(X, y, groups)))
    640 
    641         # if one choose to see train score, "out" will contain train score info

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
    330                                                              n_samples))
    331 
--> 332         for train, test in super(_BaseKFold, self).split(X, y, groups):
    333             yield train, test
    334 

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
     93         X, y, groups = indexable(X, y, groups)
     94         indices = np.arange(_num_samples(X))
---> 95         for test_index in self._iter_test_masks(X, y, groups):
     96             train_index = indices[np.logical_not(test_index)]
     97             test_index = indices[test_index]

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in _iter_test_masks(self, X, y, groups)
    632 
    633     def _iter_test_masks(self, X, y=None, groups=None):
--> 634         test_folds = self._make_test_folds(X, y)
    635         for i in range(self.n_splits):
    636             yield test_folds == i

D:\Anconda\lib\site-packages\sklearn\model_selection\_split.py in _make_test_folds(self, X, y)
    597             raise ValueError("n_splits=%d cannot be greater than the"
    598                              " number of members in each class."
--> 599                              % (self.n_splits))
    600         if self.n_splits > min_groups:
    601             warnings.warn(("The least populated class in y has only %d"

ValueError: n_splits=2 cannot be greater than the number of members in each class.

这是我的尝试

# XGB, Tune n_estimators and max_depth
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib

import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor,GradientBoostingRegressor,
from sklearn.feature_selection import SelectFromModel

from sklearn import preprocessing
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn import ensemble
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

from IPython.core.interactiveshell import InteractiveShell
matplotlib.use('Agg')
from matplotlib import pyplot
import numpy as np
#read data
Data_ini = pd.read_excel('Data - 1 output -Ra-in - Crossvalidation.xlsx').iloc[:,:]  #read data

#encode string class values as integers
label_encoded_y = LabelEncoder().fit_transform(y)
# grid search
model = GradientBoostingRegressor()
n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
print(max_depth)
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,verbose=1)
grid_result = grid_search.fit(X, label_encoded_y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
# plot results
scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))
for i, value in enumerate(max_depth):
    pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))
pyplot.legend()
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.savefig('n_estimators_vs_max_depth.png') 

【问题讨论】:

    标签: python scikit-learn cross-validation sklearn-pandas grid-search


    【解决方案1】:

    您收到此错误是因为您将StratifiedKFold 用于回归问题。来自其documentation

    此交叉验证对象是返回分层折叠的 KFold 的变体。通过保留每个类的样本百分比来进行折叠。

    当没有一个类(在回归问题中,目标值)具有多个实例时,您将收到ValueError。您可以通过

    重现此错误
    import numpy as np
    
    x = np.linspace(1, 10, 10)
    y = np.linspace(1, 10, 10)
    
    kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
    kfold.split(x, y)
    

    如果您让其中一个类拥有更多实例,您将不会收到此错误

    x = np.linspace(1, 10, 10)
    y = np.linspace(1, 10, 10)
    y[1] = 5
    
    kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=0)
    kfold.split(x, y)
    

    要使您的代码正常运行,您只需将StratifiedKFold 替换为Kfold

    编辑

    因为neg_log_loss需要predict_probaGradientBoostingRegressor中没有实现,所以不能作为评分函数使用。本质上,由于您正在训练回归模型,请使用neg_mean_absolute_error 或列出的其他回归指标here

    【讨论】:

    • 我按照您的评论将 StratifiedKFold 替换为 Kfold,但出现了这个新错误“AttributeError: 'GradientBoostingRegressor' object has no attribute 'predict_proba'”
    • @SH_IQ 啊,我忘记了,因为你在做回归,你不能用neg_log_loss来给简历打分。从this sklearn page,您可以看到使用neg_log_losspredict_proba 是必需的,正如错误所暗示的那样,GradientBoostingRegressor 中未实现。对于回归问题,请改用neg_mean_absolute_error,一切正常。
    猜你喜欢
    • 1970-01-01
    • 2020-12-01
    • 2017-05-19
    • 2018-10-15
    • 2019-09-30
    • 2019-04-07
    • 2019-07-24
    • 2020-03-17
    • 2020-10-10
    相关资源
    最近更新 更多