如何确定在 scikit learn 中执行超参数调整的最佳基线模型？答案

【问题标题】：How to determine the best baseline model to perform hyperparameter tuning on in scikit learn?如何确定在 scikit learn 中执行超参数调整的最佳基线模型？
【发布时间】：2021-10-12 12:47:33
【问题描述】：

我正在处理数据，我正在尝试不同的分类算法，看看哪一个作为基线模型表现最好。代码如下：

# Trying out different classifiers and selecting the best

## Creat list of classifiers we're going to loop through
classifiers = [
    KNeighborsClassifier(),
    SVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

classifier_names = [
    'kNN',
    'SVC',
    'DecisionTree',
    'RandomForest',
    'AdaBoost',
    'GradientBoosting'
]

model_scores = []

## Looping through the classifiers
for classifier, name in zip(classifiers, classifier_names):
    pipe = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('selector', SelectKBest(k=len(X.columns))),
        ('classifier', classifier)])
    score = cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
    model_scores.append(score)
    print("Model score for {}: {}".format(name, score))

输出是：

Model score for kNN: 0.7472524440239673
Model score for SVC: 0.7896621728161464
Model score for DecisionTree: 0.7302148734267939
Model score for RandomForest: 0.779058799919727
Model score for AdaBoost: 0.7949635904933918
Model score for GradientBoosting: 0.7930712637252372

原来最好的模型是AdaBoostClassifier()。我通常会选择最好的基线模型并对其执行GridSearchCV 以进一步提高其基线性能。

但是，如果假设作为基线模型表现最好的模型（在本例中为 AdaBoost），通过超参数调整仅提高 1%，而最初表现不佳的模型（例如 SCV() )，会有更多的“潜力”，通过超参数调整来改进（例如，提高 4%），并且在调整之后会最终成为更好的模型？

有没有办法预先知道这个“潜力”，而无需对所有分类器执行 GridSearch？

【问题讨论】：

标签： python scikit-learn hyperparameters

【解决方案1】：

不，在超参数调整之前没有办法知道 100% 确定哪个分类器最终会在任何给定问题上表现最佳。然而，在实践中，Kaggle 竞赛在表格数据分类问题（与基于文本或图像的问题相反）上的表现是，几乎在所有情况下，基于梯度提升的决策树模型（如 XGBoost 或 LightGBM）效果最好。鉴于此，GradientBoosting 可能会在超参数调整下表现更好，因为它是based off LightGBM。

您在上面的代码中所做的是简单地使用超参数的所有默认值，并且对于那些对超参数调整更敏感的算法，它不一定表示最终（微调）性能，因为您已经建议了。

【讨论】：

【解决方案2】：

是的，有多元化，双变量和多变量分析，以查看数据，然后决定您可以启动的模型作为基准。

您还可以使用Sklearn方法选择合适的估算器。

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

【讨论】：