sklearn，python中的网格搜索技术答案

【问题标题】：Gridsearch technique in sklearn, pythonsklearn，python中的网格搜索技术
【发布时间】：2017-09-09 20:51:06
【问题描述】：

我正在研究一种有监督的机器学习算法，它似乎有一种奇怪的行为。那么，让我开始吧：

我有一个函数可以传递不同的分类器、它们的参数、训练数据和它们的标签：

def HT(targets,train_new, algorithm, parameters):
#creating my scorer
scorer=make_scorer(f1_score)
#creating the grid search object with the parameters of the function
grid_search = GridSearchCV(algorithm, 
param_grid=parameters,scoring=scorer,   cv=5)
# fit the grid_search object to the data
grid_search.fit(train_new, targets.ravel())
# print the name of the classifier, the best score and best parameters
print algorithm.__class__.__name__
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
# assign the best estimator to the pipeline variable
pipeline=grid_search.best_estimator_
# predict the results for the training set
results=pipeline.predict(train_new).astype(int)
print results    
return pipeline

我向这个函数传递如下参数：

clf_param.append( {'C' : np.array([0.001,0.01,0.1,1,10]), 
'kernel':(['linear','rbf']),
'decision_function_shape' : (['ovr'])})

好的，这就是事情开始变得奇怪的地方。此函数返回 f1_score 但它与我使用公式手动计算的分数不同： F1 = 2 * (精度 * 召回率) / (精度 + 召回率)

有相当大的差异（0.68 与 0.89 相比）

我在函数中做错了什么？ grid_search (grid_search.best_score_) 计算的分数应该与整个训练集的分数相同 (grid_search.best_estimator_.predict(train_new)) 吗？谢谢

【问题讨论】：

请说明您是如何手动计算分数的。这是二分类还是多标签分类？
还将问题标题更改为与分数差异相关的更合适的名称。当前标题与您的实际问题无关

标签： python machine-learning scikit-learn cross-validation grid-search

【解决方案1】：

您手动计算的分数考虑了所有类别的全局真阳性和阴性。但在 scikit 中，f1_score，默认的做法是计算二元平均值（即只针对正类）。

因此，为了获得相同的分数，请使用如下指定的 f1_score：

scorer=make_scorer(f1_score, average='micro')

或者简单地说，在 gridSearchCV 中，使用：

scoring = 'f1_micro'

有关如何进行分数平均的更多信息请参见： - http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

您可能还想看看下面的答案，它详细描述了 scikit 中的分数计算：-

https://stackoverflow.com/a/31575870/3374996

编辑：将宏观改为微观。如文档中所述：

'micro'：通过计算总真值来全局计算指标阳性、假阴性和假阳性。

【讨论】：

感谢 Vivek 的回答。我的问题是二元分类问题。我知道训练数据和标签，我正在应用公式。此外，在执行 grid_search 之后，为了进行预测，我是否需要使用网格搜索的最佳参数再次将模型拟合到整个训练集？我假设进行交叉验证的网格搜索只返回适合训练集的一部分的分类器。
@Vlad No. GridSearchCV 估计器将使用最佳参数重新拟合整个训练数据。您可以查看文档。实际上它的构造函数中有一个参数“refit”。默认情况下是真的。因此它将使用最佳参数重新调整提供给它的整个数据。
谢谢维韦克。很好的帮助