ValueError：发现样本数量不一致的数组 [6 1786]答案

【问题标题】：ValueError: Found arrays with inconsistent numbers of samples [ 6 1786]ValueError：发现样本数量不一致的数组 [6 1786]
【发布时间】：2016-05-24 13:58:48
【问题描述】：

这是我的代码：

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
y_scaled = TD_IF.fit_transform(newsgroups, y)
grid = {'C': np.power(10.0, np.arange(-5, 6))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) 
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X, y_scaled)

我遇到了错误，我不明白为什么。回溯：

Traceback（最近一次调用最后一次）：文件
“C:/Users/Roman/PycharmProjects/week_3/assignment_2.py”，第 23 行，在

gs.fit(X, y_scaled) #TODO: 检查这一行 File "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\grid_search.py",
第 804 行，适合
return self._fit(X, y, ParameterGrid(self.param_grid)) 文件 "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\grid_search.py",
第 525 行，在 _fit
X, y = indexable(X, y) 文件 "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py",
第 201 行，可索引
check_consistent_length(*result) 文件 "C:\Users\Roman\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py",
第 176 行，在 check_consistent_length
"%s" % str(唯一))

ValueError：发现样本数量不一致的数组：[6 1786]

有人能解释为什么会出现这个错误吗？

【问题讨论】：

标签： python machine-learning scikit-learn text-analysis

【解决方案1】：

我认为您对这里的X 和y 有点困惑。您想将 X 转换为 tf-idf 向量并使用它针对 y 进行训练。见下文

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import datasets
import numpy as np

newsgroups = datasets.fetch_20newsgroups(
                subset='all',
                categories=['alt.atheism', 'sci.space']
         )
X = newsgroups.data
y = newsgroups.target

TD_IF = TfidfVectorizer()
X_scaled = TD_IF.fit_transform(X, y)
grid = {'C': np.power(10.0, np.arange(-1, 1))}
cv = KFold(y_scaled.size, n_folds=5, shuffle=True, random_state=241) 
clf = SVC(kernel='linear', random_state=241)

gs = GridSearchCV(estimator=clf, param_grid=grid, scoring='accuracy', cv=cv)
gs.fit(X_scaled, y)

【讨论】：

谢谢你，帮了大忙！所以愚蠢的错误=）