XGBoost 模型上的 GridSearchCV 给出错误答案

【问题标题】：GridSearchCV on XGBoost model gives errorXGBoost 模型上的 GridSearchCV 给出错误
【发布时间】：2018-03-27 04:02:44
【问题描述】：

我在 python 中做了一个 XGBoost 分类器。我试着做GridSearch 来找到像这样的最佳参数

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

运行搜索时出现这样的错误

[Errno 28] No space left on device

我使用了一个稍大的数据集。在哪里， X.shape = (38932, 1002) Y.shape= (38932,)

有什么问题。？如何解决？

这是因为数据集对于我的机器来说太大了。？如果是这样，我该怎么做才能在此数据集上执行 GridSearch。？

【问题讨论】：

请通过提供样本和形状或数据链接来包含数据集的描述
我已经编辑了问题并添加了形状
您是否遇到了类似的问题：stackoverflow.com/a/6999259/1577947

标签： machine-learning scikit-learn xgboost grid-search

【解决方案1】：

错误表明共享内存已用完，可能增加 kfolds 的数量和/或调整使用的线程数，即 n_jobs 将解决此问题 .这是一个使用 xgboost 的工作示例

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn import datasets

clf = xgb.XGBClassifier()
parameters = {
    'n_estimators': [100, 250, 500],
    'max_depth': [6, 9, 12],
    'subsample': [0.9, 1.0],
    'colsample_bytree': [0.9, 1.0],
}
bsn = datasets.load_iris()
X, Y = bsn.data, bsn.target
grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X, Y)
print("Best: %f using %s" % (grid.best_score_, grid.best_params_))

means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

输出是

Best: -0.121569 using {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 6, 'n_estimators': 500, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 9, 'n_estimators': 500, 'subsample': 1.0}
-0.126334 (0.080193) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 100, 'subsample': 0.9}
-0.121569 (0.081561) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 100, 'subsample': 1.0}
-0.139359 (0.075462) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 250, 'subsample': 0.9}
-0.131887 (0.076174) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 250, 'subsample': 1.0}
-0.148302 (0.074890) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 500, 'subsample': 0.9}
-0.135973 (0.076167) with: {'colsample_bytree': 0.9, 'max_depth': 12, 'n_estimators': 500, 'subsample': 1.0}
-0.132745 (0.080433) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 100, 'subsample': 0.9}
-0.127030 (0.077692) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 100, 'subsample': 1.0}
-0.146143 (0.077623) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 250, 'subsample': 0.9}
-0.140400 (0.074645) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 250, 'subsample': 1.0}
-0.153624 (0.077594) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 500, 'subsample': 0.9}
-0.143833 (0.073645) with: {'colsample_bytree': 1.0, 'max_depth': 6, 'n_estimators': 500, 'subsample': 1.0}
-0.132745 (0.080433) with: {'colsample_bytree': 1.0, 'max_depth': 9, ...

【讨论】：

我之前已经在我的机器上成功运行过 CVSearch。我只面临这个数据集的问题。
我会在没有kFold 的情况下尝试并告诉你它是怎么回事
您可能还想打开网格搜索中的详细程度，即verbose = 5，以查看是否某些参数值导致了问题。
我现在使用cv=5 和verbose=3 运行它。我看到参数变化不大。