SCIKIT 学习多类错误答案

【问题标题】：SCIKIT Learn multi-class errorSCIKIT 学习多类错误
【发布时间】：2014-03-15 07:20:00
【问题描述】：

我有一个脚本在大约一年前成功运行，但现在不再运行。我使用 pandas 将数据处理成这样：

df_train

    dtu_docid                                    dtu_topic_split         y_train
0   2012-1553          [Energy Taxation, State & Local Taxation]         [3, 23]
2   2010-0227            [Quantitative Economics and Statistics]            [34]
3   2010-0215                     [International Taxation, Asia]         [0, 19]

然后使用scikit如下：

classifier = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english',
                                   ngram_range=(1,3),
                                   max_df = 1.0,
                                   min_df = 0.0,
                                   analyzer='word')),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(verbose=1)))])


classifier.fit(df_train.dtu_content, df_train.y_train)

现在得到一个让我抓狂的错误：

ValueError: Expected array-like (array or non-string sequence), got 0 [3, 23]
2 [34]
3 [0, 19]
4 []
5 [3]
8 [8, 27]
9 [10]
11 [15]
12 [0, 7]
13 [1, 4]
14 [1, 4, 13] ... (truncated)
15 [11] ... (truncated)

看起来大约 9 个月前对 multiclass.py 模块进行了增强，导致额外检查，但我不知道如何修复。有人以前见过这个或有想法吗？

【问题讨论】：

我今天早上又在研究这个问题，并在 github 中发现了一些关于可能修复的神秘注释。似乎熊猫或 scikit 的最新版本破坏了一些非常重要的东西。恕我直言，这是使用 pandas 和 scikit 的一个关键方面——它们曾经以无缝、简单和自然的方式协同工作。是否有已知的解决方法或估计何时会纠正不兼容性？
df_train 是如何构造的？请发帖SSCCE。
DF train 是用 pandas 中的大量数据处理创建的，有问题的属性是 y_train。 Ytrain 是与训练示例关联的类的列表。使用该列表是因为这是一种多类情况，其中每个样本可以低于一个类。
'''code# 将 y_train 构建并填充为整数 def get_ytrain(x): catlist = [] for icat in range(len(label)): if label[icat] in x: catlist. append(icat) return catlist df_train.y_train = df_train.dtu_topic_split.apply(get_ytrain) df_holdout.y_train = df_holdout.dtu_topic_split.apply(get_ytrain) print df_train[['dtu_docid','dtu_topic_split','y_train','predicted'] ][:20] 打印 df_holdout[['dtu_docid','dtu_topic_split','y_train','predicted']][:20] 打印 df_train.dtypes

标签： python pandas scikit-learn

【解决方案1】：

我也碰到了这个。正如您所指出的，multiclass.py 有一些保守的验证：

# XXX: is there a way to duck-type this condition?
valid = (isinstance(y, (np.ndarray, Sequence, spmatrix))
         and not isinstance(y, string_types))
if not valid:
    raise ValueError('Expected array-like (array or non-string sequence), '
                     'got %r' % y)

Pandas 0.13.0 还有changed how Series is implemented:

警告

在 0.13.0 系列中，内部已重构为不再是 ndarray 的子类，而是 NDFrame 的子类，类似于其他 pandas 容器。这应该是一个透明的更改，只有非常有限的 API 影响（请参阅内部重构）

Internal Refactoring 说明说明了该怎么做：

将 Series 直接传递给期望 ndarray 类型的 cython 函数将不再直接工作，您必须传递 Series.values

所以在你的情况下，我建议你试试这个：

classifier.fit(df_train.dtu_content, df_train.y_train.values)

【讨论】：