添加一些数据后 OneVsRest 分类器失败答案

【问题标题】：OneVsRest Classifier fails after adding some data添加一些数据后 OneVsRest 分类器失败
【发布时间】：2015-09-30 11:06:09
【问题描述】：

我试图让一个非常简单的 scikit OneVsRest 分类器工作，但遇到了一个奇怪的问题

这里是代码

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing

input_file = "small.csv"

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv(input_file, sep=',', quotechar='"', encoding='utf-8')  

codes = df.ix[:,'act_code1':'act_code33']

y = []

for index, row in codes.iterrows():
  row = row[np.logical_not(np.isnan(row))].astype(str)
  row = row.tolist()
  y.append(row)

lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y)

classifier = Pipeline([
   ('vectorizer', CountVectorizer()),
   ('tfidf', TfidfTransformer()),
   ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(df['text'], Y)

predicted = classifier.predict(["BASIC SOCIAL SERVICES AID IN ARARATECA VALLEY"])

all_labels = lb.inverse_transform(predicted)

print all_labels

small.csv的内容在这里：

https://drive.google.com/file/d/0Bzt48lX3efsQTnYySFdaTlZhZGc/view?usp=sharing

何时尝试分类，我收到以下警告，并且没有分类发生

UserWarning: indices array has non-integer dtype (float64)
  % self.indices.dtype.name)
[()]

但是，如果您删除开始的行（第 6 行）：

61821559,LEATHER PROJECT SKILLS TRAININ

代码正常工作，产生正确的分类输出 ([('15150.07',)])。您也可以通过删除最后一行来“修复”这个问题。这是怎么回事？

编辑：只是为了确保我正确地传达了问题：这是一个文本标签分类问题，而不是数字回归曲线拟合。标签中的“数字”旨在被视为文本字符串（它们是）。这是一个多标签分类问题。

【问题讨论】：

标签： python-2.7 pandas scikit-learn

【解决方案1】：

问题在于您的代码的以下部分：

y = []

for index, row in codes.iterrows():
  row = row[np.logical_not(np.isnan(row))].astype(str)
  row = row.tolist()
  y.append(row)

print(y)

[['12105.01', '15150.07', '15130.06', '11105.01', '16010.07', '16020.05'], ['99810.01'], ['11430.02', '15140.01'], ['16010.05', '15150.07'], ['32120.08', '32181.01', '16010.01'], ['99810.01'], ['72020.01'], ['72010.01']]

act_code 的数值不是标签...列名act_code 本身就是。顺便说一句，您正在做分类任务吗？如果我理解正确，根据text 输入，您尝试将其分类为act_code 1:33 中的一个/多个。如果您的真正目的是预测某个数值（在您的帖子中，output ([('15150.07',)]) 真的让我感到困惑），那么您必须完全重新制定您的所有项目，因为这是一个回归问题而不是分类问题。

你应该改用

y = [row.index[row.notnull()].tolist() for _, row in y_codes.iterrows()]

[[u'act_code1', u'act_code2', u'act_code3', u'act_code4', u'act_code5', u'act_code6'], [u'act_code1'], [u'act_code1', u'act_code2'], [u'act_code1', u'act_code2'], [u'act_code1', u'act_code2', u'act_code3'], [u'act_code1'], [u'act_code1'], [u'act_code1']]

完整的工作代码：

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
import pandas as pd

input_file = '/home/Jian/Downloads/small.csv'
df = pd.read_csv(input_file, sep=',', quotechar='"', encoding='utf-8')
y_codes = df.ix[:,'act_code1':'act_code33']

# process your y-label
# ==============================
y = [row.index[row.notnull()].tolist() for _, row in y_codes.iterrows()]

lb = lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y)

print(Y)

# standard text classificaiton with multi-label classes
# ======================================================
# CountVectorizer + TfidTransformer is equivalent to TfidfVectorizer
classifier = make_pipeline(TfidfVectorizer(), OneVsRestClassifier(LinearSVC()))

X = df.text.values
# give a warning msg: Label 0 is present in all training examples.
# it's fine since this is just a very small sample
# in reality, it's unlikely for all your obs belong to class 0
classifier.fit(X, Y)

y_pred = classifier.predict(["BASIC SOCIAL SERVICES AID IN ARARATECA VALLEY"])

all_labels = lb.inverse_transform(y_pred)

print(all_labels)

[(u'act_code1',)]

【讨论】：

感谢Jian，但是act_code 字段中的值是标签——它不能保证相同的值会在act_code1 中始终如一。 act_code1 可以是第一行中的 99810.01，然后是下一行中的 71109.90。有没有办法使用数值作为标签使其工作？我不希望分类回答它的 act_code1，而是回答数值。
@ScottStewart 分类将每个标签视为分类变量，并假设您无法在不同标签之间进行比较，例如，苹果和橙色是两个标签，我们不能说苹果优于橙色。但在苹果标签中，不同的苹果可能有不同程度的甜度，因此通过比较它的numeric 甜度，一个苹果可能比另一个更好。类似的逻辑适用于您的任务，您需要首先对这个特定的text 属于哪个act_code 进行分类，然后在该标签内运行回归以预测该值。
@ScottStewart 我帖子中的所有代码都涉及分类部分。如果您需要任何数字预测，则需要添加进一步的回归部分。
我的标签看起来像这样 [['12105.01', '15150.07', '15130.06', '11105.01', '16010.07', '16020.05'], ['99810.01'], ['11430.02' , '15140.01'], ['16010.05', '15150.07'], ['32120.08', '32181.01', '16010.01'], ['99810.01'], ['72020.01'], ['72010.01']], 那些是类别，而不是它们出现的列。这不是一个数字预测。这些数字是文本标签。如果您在我的示例中打印 y，您将看到它们是文本标签。它们被强制转换为 str
@ScottStewart 我不认为这些数值是标签......让我们等待其他人的一些答案。