GridSearchCV.fit() 返回 TypeError: Expected sequence or array-like, got estimator答案

【问题标题】：GridSearchCV.fit() returns TypeError: Expected sequence or array-like, got estimatorGridSearchCV.fit() 返回 TypeError: Expected sequence or array-like, got estimator
【发布时间】：2017-07-03 20:19:54
【问题描述】：

我正在尝试按照 Building Machine Learning Systems in Python 一书的第 6 章对 Twitter 数据进行情感分析。

我正在使用数据集：https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv

它使用 tfidf 向量化器和朴素贝叶斯分类器的管道作为估计器。

然后我使用 GridSearchCV() 为估算器找到最佳参数。

代码如下：

from load_data import load_data
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

def pipeline_tfidf_nb():
    tfidf_vect = TfidfVectorizer( analyzer = "word")
    naive_bayes_clf = MultinomialNB()
    return Pipeline([('vect', tfidf_vect),('nbclf',naive_bayes_clf)])

input_file = "full-corpus.csv"
X,y = load_data(input_file)
print X.shape,y.shape

clf = pipeline_tfidf_nb()
cv = ShuffleSplit(n = len(X), test_size = .3, n_iter = 1, random_state = 0)

clf_param_grid = dict(vect__ngram_range = [(1,1),(1,2),(1,3)],
                   vect__min_df = [1,2],
                    vect__smooth_idf = [False, True],
                    vect__use_idf = [False, True],
                    vect__sublinear_tf = [False, True],
                    vect__binary = [False, True],
                    nbclf__alpha = [0, 0.01, 0.05, 0.1, 0.5, 1],
                  )

grid_search = GridSearchCV(estimator = clf, param_grid = clf_param_grid, cv = cv, scoring = f1_score)
grid_search.fit(X, y)

print grid_search.best_estimator_

load_data() 从 csv 文件中提取具有正面或负面情绪的值。

X 是字符串数组（TweetText），y 是布尔值数组（True 表示积极情绪）。

错误是：

runfile('C:/Users/saurabh.s1/Downloads/Python_ml/ch6/main.py', wdir='C:/Users/saurabh.s1/Downloads/Python_ml/ch6')
Reloaded modules: load_data
negative : 572
positive : 519
(1091,) (1091,)
Traceback (most recent call last):

  File "<ipython-input-25-823b07c4ff26>", line 1, in <module>
    runfile('C:/Users/saurabh.s1/Downloads/Python_ml/ch6/main.py', wdir='C:/Users/saurabh.s1/Downloads/Python_ml/ch6')

  File "C:\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Users/saurabh.s1/Downloads/Python_ml/ch6/main.py", line 31, in <module>
    grid_search.fit(X, y)

  File "C:\anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))

  File "C:\anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
    for parameters in parameter_iterable

  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):

  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)

  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
    self.results = batch()

  File "C:\anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1550, in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)

  File "C:\anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1606, in _score
    score = scorer(estimator, X_test, y_test)

  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 639, in f1_score
    sample_weight=sample_weight)

  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 756, in fbeta_score
    sample_weight=sample_weight)

  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 956, in precision_recall_fscore_support
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)

  File "C:\anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 72, in _check_targets
    check_consistent_length(y_true, y_pred)

  File "C:\anaconda2\lib\site-packages\sklearn\utils\validation.py", line 173, in check_consistent_length
    uniques = np.unique([_num_samples(X) for X in arrays if X is not None])

  File "C:\anaconda2\lib\site-packages\sklearn\utils\validation.py", line 112, in _num_samples
    'estimator %s' % x)

TypeError: Expected sequence or array-like, got estimator Pipeline(steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None,
        smooth_i...e_idf=False, vocabulary=None)), ('nbclf', MultinomialNB(alpha=0, class_prior=None, fit_prior=True))])

我尝试过重塑 X,y 但这不起作用。

如果您需要更多数据或我遗漏了什么，请告诉我。

谢谢！

【问题讨论】：

标签： python-2.7 machine-learning scikit-learn sentiment-analysis grid-search

【解决方案1】：

此错误是因为您使用 scoring=f1_score 将错误的参数传递给 GridSearchCV 构造函数。看看documentation of GridSearchCV。

在评分参数中，它要求：

一个字符串（参见模型评估文档）或带有签名 scorer(estimator, X, y) 的 scorer 可调用对象/函数。如果为 None，则使用估计器的 score 方法。

您正在传递一个带有签名(y_true, y_pred[, ...]) 的可调用函数，这是错误的。这就是你收到错误的原因。您应该使用string as defined here 传递得分，或传递带有签名(estimator, X, y) 的可调用对象。这可以通过使用make_scorer 来完成。

在您的代码中更改这一行：

grid_search = GridSearchCV(estimator = clf, param_grid = clf_param_grid, 
                           cv = cv, scoring = f1_score)

到这里：

grid_search = GridSearchCV(estimator = clf, param_grid = clf_param_grid,
                           cv = cv, scoring = 'f1')

              OR

grid_search = GridSearchCV(estimator = clf, param_grid = clf_param_grid,
                           cv = cv, scoring = make_scorer(f1_score))

我已经回答了相同类型的问题in this answer here

【讨论】：