【问题标题】:How to use GridsearchCV with a pipeline and multiple classifiers?如何将 GridsearchCV 与管道和多个分类器一起使用?
【发布时间】:2022-01-27 04:08:13
【问题描述】:

我构建了一个modelPipeline,它运行多个分类器并将pipeline和每个分类器的分数作为DataFrame返回。

如何在下面的modelPipeline中使用GridsearchCV?是否可以在 Pipeline 中将GridsearchCV 与多个分类器一起使用?

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import sklearn.metrics as skm

import os
rs = {'random_state': 42}
# Train-test Split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.33, 
                                                    random_state = 42)
# Classification - Model Pipeline
def modelPipeline(X_train, X_test, y_train, y_test):

    log_reg = LogisticRegression(**rs)
    nb = BernoulliNB()
    knn = KNeighborsClassifier()
    svm = SVC(**rs)
    mlp = MLPClassifier(max_iter=500, **rs)
    dt = DecisionTreeClassifier(**rs)
    et = ExtraTreesClassifier(**rs)
    rf = RandomForestClassifier(**rs)
    xgb = XGBClassifier(**rs, verbosity=0)

    clfs = [
            ('Logistic Regression', log_reg), 
            ('Naive Bayes', nb),
            ('K-Nearest Neighbors', knn), 
            ('SVM', svm), 
            ('MLP', mlp), 
            ('Decision Tree', dt), 
            ('Extra Trees', et), 
            ('Random Forest', rf), 
            ('XGBoost', xgb)
            ]


    pipelines = []

    scores_df = pd.DataFrame(columns=['Model', 'F1_Score', 'Precision', 'Recall', 'Accuracy', 'ROC_AUC'])


    for clf_name, clf in clfs:

        pipeline = Pipeline(steps=[
                                   ('scaler', StandardScaler()),
                                   ('classifier', clf)
                                   ]
                            )
        pipeline.fit(X_train, y_train)


        y_pred = pipeline.predict(X_test)
        # F1-Score
        fscore = skm.f1_score(y_test, y_pred)
        # Precision
        pres = skm.precision_score(y_test, y_pred)
        # Recall
        rcall = skm.recall_score(y_test, y_pred)
        # Accuracy
        accu = skm.accuracy_score(y_test, y_pred)
        # ROC_AUC
        roc_auc = skm.roc_auc_score(y_test, y_pred)


        pipelines.append(pipeline)

        scores_df = scores_df.append({
                                      'Model' : clf_name, 
                                      'F1_Score' : fscore,
                                      'Precision' : pres,
                                      'Recall' : rcall,
                                      'Accuracy' : accu,
                                      'ROC_AUC' : roc_auc
                                      
                                      }, 
                                     ignore_index=True)
        
    return pipelines, scores_df

【问题讨论】:

    标签: python machine-learning scikit-learn gridsearchcv


    【解决方案1】:

    从您的 cmets 对我的另一个答案,也许您只是想调整每个模型? (那么您应该将示例简化为单个分​​类器,因为多个分类器将独立运行(?)。)

    所以,例如

        log_reg_params = {'C': [0.1, 1, 10]}
        ...
        xgb_params = {
            'learning_rate': [0.05, 0.1, 0.2],
            'max_depth': [1, 2, 3, 5, 8],
            'reg_lambda': [0, 1, 10],
        }
    
        clfs = [
            ('Logistic Regression', log_reg, log_reg_params), 
            ('Naive Bayes', nb, nb_params),
            ...
            ('XGBoost', xgb, xgb_params),
        ]
        for clf_name, clf, param_grid in clfs:
            pipeline = Pipeline(steps=[
                ('scaler', StandardScaler()),
                ('classifier', clf),
            ])
            search = GridSearchCV(pipeline, {f'classifier__{paramname}': paramvalue for paramname, paramvalue in param_grid.items()})
            search.fit(X_train, y_train)
            ...
    

    【讨论】:

    • 我不知道我们可以在Pipeline 中包含('search', GridSearchCV(clf, param_grid))。我将包括所有参数并测试代码。测试后会通知您
    • 实际上,最好切换它并搜索管道。这样,缩放器在搜索的训练/测试折叠上单独运行。 (已编辑。)
    【解决方案2】:

    GridSearchCV 可以得到一个分类器列表,供管道中的最后一步选择。但它不会完全按照您的代码执行:最值得注意的是,拟合模型不会被 GridSearchCV 保存,只是分数(以及最终选择的 refit-on-all-data 模型,如果 refit != False )。

    pipe = Pipeline(steps=[
        ('scaler', StandardScaler()),
        ('classifier', DummyClassifier()),  # doesn't matter, we're going to override this in the search
    ])
    params = {
        'classifier': [log_reg, nb, knn, svm, mlp, dt, et, rf, xgb],
    }
    scoring = ['f1', 'precision', 'recall', 'accuracy', 'roc_auc']
    search = GridSearchCV(pipe, params, scoring=scoring, refit=False)
    

    (多个指标需要将refit 设置为Falsescoring 条目之一或自定义可调用项。)

    【讨论】:

    • clfsclassifier 的参数。 params 是分类器和参数的字典??
    • 我把你的clfs 误认为只是一个分类器列表,我忘了你也把名字放在那里。我会编辑。但我不太明白你的问题。
    • 哦。现在我可以使用默认参数在管道中应用多个分类器。现在我想包含一个像 GridsearchCV 这样的优化算法来优化和获得分类器的最佳参数
    • 我不知道如何在管道中添加 Gridsearchcv 并使用最佳调整参数获取评分指标
    • 哦,也许我误解了你原来的问题......
    猜你喜欢
    • 2020-10-21
    • 1970-01-01
    • 2020-01-12
    • 2017-09-08
    • 2015-11-19
    • 2019-07-26
    • 1970-01-01
    • 2014-01-29
    相关资源
    最近更新 更多