【问题标题】:Error when trying to run a GridSearchCV on sklearn Pipeline尝试在 sklearn 管道上运行 GridSearchCV 时出错
【发布时间】:2021-06-23 08:19:21
【问题描述】:

我正在尝试通过 GridSearchCV 运行带有 TFIDF 矢量化器和 XGBoost 分类器的 sklearn 管道,但由于内部错误,它无法正常工作。数据是 4000 个句子,标记为真或假(1 或 0)。这是代码:

import numpy as np
import pandas as pd

from gensim import utils
import gensim.parsing.preprocessing as gsp

from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator

from sklearn.feature_extraction.text import TfidfVectorizer

import xgboost as xgb

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

train = pd.read_csv("train_data.csv")
test = pd.read_csv("test_data.csv")
train_x = train.iloc[:, 0]
train_y = train.iloc[:, 1]

test_x = test.iloc[:, 0]
test_y = test.iloc[:, 1]

folds = 4

xgb_parameters = {
                'xgboost__n_estimators': [1000, 1500],
                'xgboost__max_depth': [12, 15],
                'xgboost__learning_rate': [0.1, 0.12],
                'xgboost__objective': ['binary:logistic']
}

model = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                         ('xgboost', xgb.XGBClassifier())])

gs_cv = GridSearchCV(estimator=model,
                     param_grid=xgb_parameters,
                     n_jobs=1,
                     refit=True,
                     cv=2,
                     scoring=f1_score)
gs_cv.fit(train_x, train_y)

但我收到一个错误:

>>> gs_cv.fit(train_x, train_y)
C:\Users\draga\miniconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
[21:31:18] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py:70: FutureWarning: Pass labels=0       0
1       1
2       1
3       0
4       1
       ..
2004    0
2005    0
2008    0
2009    0
2012    0
Name: Bad Sentence, Length: 2000, dtype: int64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error       
  warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py:683: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 74, in inner_f
    return f(**kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1068, in f1_score
    return fbeta_score(y_true, y_pred, beta=1, labels=labels,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1192, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(y_true, y_pred,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1461, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels,
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1274, in _check_set_wise_labels
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\metrics\_classification.py", line 83, in _check_targets
    check_consistent_length(y_true, y_pred)
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 259, in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "C:\Users\draga\miniconda3\lib\site-packages\sklearn\utils\validation.py", line 192, in _num_samples
    raise TypeError(message)
TypeError: Expected sequence or array-like, got <class 'sklearn.pipeline.Pipeline'> 
  1. 可能是什么问题?

  2. 我需要在管道中包含TfidfVectorizer() 的转换方法吗?

【问题讨论】:

  • 您只用一个自变量拟合模型,对吗?我真的不确定这是否能解决问题,但您可以尝试将train_x.reshape(-1, 1) 传递给gs_cv.fit()
  • @ArturoSbr,这通常是一个要求,但像 TfidfVectorizer 这样的文本转换器实际上需要一维输入。

标签: python scikit-learn nlp xgboost gridsearchcv


【解决方案1】:

主要问题是您的搜索scoring参数。 Sklearn中的超级参数调谐器的分度器需要在签名(estimator, X, y)。您可以使用make_scorer便利函数,或者在这种情况下只需将名称作为字符串,scorer="f1"

请参阅文档,the list of builtinsinformation on signatures

(您无需明确使用transform方法;在管道内部处理。)

【讨论】:

  • 正确!我所做的只是f1 = make_scorer(f1_score)。谢谢你的scorer="f1"这是更优雅的选择。 span>
猜你喜欢
  • 2014-01-29
  • 1970-01-01
  • 2019-07-15
  • 2022-09-28
  • 2020-03-08
  • 2016-08-09
  • 2014-02-02
  • 1970-01-01
相关资源
最近更新 更多