GridSearchCV 和树分类器答案

【问题标题】：GridSearchCV and the tree classifierGridSearchCV 和树分类器
【发布时间】：2020-09-21 17:45:14
【问题描述】：

在这个post 中提到了

  param_grid = {'max_depth': np.arange(3, 10)}
  tree = GridSearchCV(DecisionTreeClassifier(), param_grid)
  tree.fit(xtrain, ytrain)
  tree_preds = tree.predict_proba(xtest)[:, 1]
  tree_performance = roc_auc_score(ytest, tree_preds)

Q1：一旦我们执行上述步骤并获得最佳参数，我们需要用所有数据（训练 + 验证）和学习参数拟合一棵树吗？

Q2：参数中特别提到了max_depth，可以通过访问tree.best_params_来获取，那么grid找到的其他参数呢？如何访问这些来构建一棵好树？

【问题讨论】：

解决方案有效吗？
请看下面

标签： python scikit-learn tree classification gridsearchcv

【解决方案1】：

回答您的第一个问题，当您创建GridSearchCV 对象时，您可以将参数refit 设置为True（默认值为True），它使用整个数据集上找到的最佳参数返回一个估计器，并且它可以通过best_estimator_ 属性访问。它的行为就像一个普通的估计器，并像任何其他 sklearn 估计器一样支持.predict 方法。

现在回答您的第二个问题，您可以使用 best_estimator_ 属性本身访问决策树模型的所有参数，这些参数用于拟合最终估计器，但正如我之前所说，您不需要使用最佳参数拟合新分类器，因为 refit=True 会为您完成。

请按照下面的示例代码更好地理解这一点：

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

X, y = make_classification(random_state=0)
param_grid = {'max_depth': np.arange(3, 10), 'min_samples_leaf':np.arange(2,10)}
tree = GridSearchCV(DecisionTreeClassifier(), param_grid)
tree.fit(X, y)
GridSearchCV(cv=None, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': array([3, 4, 5, 6, 7, 8, 9]),
                         'min_samples_leaf': array([2, 3, 4, 5, 6, 7, 8, 9])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

# This is how your best estimator looks like
print(tree.best_estimator_)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=6, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

# you can directly use it for prediction as shown below
tree.best_estimator_.predict(X) 
array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0])

希望这会有所帮助！

【讨论】：

谢谢。在您的示例中，您只使用了 X。如果您使用了 X_train 和 X_valid 会怎样？我的问题是，如果你只使用 X_train 进行训练，你是否应该返回并使用网格搜索最佳参数和 X 再次进行训练？
不幸的是，上面的代码抛出错误：TypeError: __init__() got an unexpected keyword argument 'ccp_alpha'。几天前我更新了我的 anaconda，所以我不确定是什么问题