分类标记文本的问题，错误的预测？答案

【问题标题】：Problems classifiying labeled text, wrong prediction?分类标记文本的问题，错误的预测？
【发布时间】：2015-02-27 02:52:02
【问题描述】：

我正在使用 scikit-learn 提供的不同分类器和矢量化器，所以假设我有以下内容：

training = [["this was a good movie, 'POS'"],
      ["this was a bad movie, 'NEG'"],
      ["i went to the movies, 'NEU'"], 
      ["this movie was very exiting it was great, 'POS'"], 
      ["this is a boring film, 'NEG'"]
        ,........................,
          [" N-sentence, 'LABEL'"]]

#Where each element of the list is another list that have documents, then.

splitted = [#remove the tags from training]

from sklearn.feature_extraction.text import HashingVectorizer
X = HashingVectorizer(
    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(splitted)

print X.toarray()

然后我有这个向量表示：

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

这个问题是我不知道我是否对语料库进行了正确的向量化，那么：

#This is the test corpus:
test = ["I don't like this movie it sucks it doesn't liked me"]

#I vectorize the corpus with hashing vectorizer
Y = HashingVectorizer(
    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(test)

然后我打印Y:

[[ 0.  0.  0. ...,  0.  0.  0.]]

然后

y = [x[-1]for x in training]

#import SVM and classify
from sklearn.svm import SVC
svm = SVC()
svm.fit(X, y)
result = svm.predict(X)
print "\nThe opinion is:\n",result

这就是问题所在，我在 [NEG] 中插入了以下内容，这实际上是正确的预测：

["this was a good movie, 'POS'"]

我想我没有正确矢量化 training 或 y 目标是错误的，谁能帮助我了解正在发生的事情以及我应该如何矢量化 training 以获得正确的预测？

【问题讨论】：

标签： python machine-learning nlp scikit-learn nltk

【解决方案1】：

我会让你把训练数据变成预期的格式：

training = ["this was a good movie",
            "this was a bad movie",
            "i went to the movies",
            "this movie was very exiting it was great", 
            "this is a boring film"]

labels = ['POS', 'NEG', 'NEU', 'POS', 'NEG']

特征提取

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> vect = HashingVectorizer(n_features=5, stop_words='english', non_negative=True)
>>> X_train = vect.fit_transform(training)
>>> X_train.toarray()
[[ 0.          0.70710678  0.          0.          0.70710678]
 [ 0.70710678  0.70710678  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.89442719  0.          0.4472136   0.        ]
 [ 1.          0.          0.          0.          0.        ]]

对于更大的语料库，您应该增加n_features 以避免冲突，我使用了 5 以便可以可视化生成的矩阵。另请注意，我使用了stop_words='english'，我认为在这么少的例子中去掉停用词很重要，否则你可能会混淆分类器。

模型训练

from sklearn.svm import SVC

model = SVC()
model.fit(X_train, labels)

预测

>>> test = ["I don't like this movie it sucks it doesn't liked me"]
>>> X_pred = vect.transform(test)
>>> model.predict(X_pred)
['NEG']

>>> test = ["I think it was a good movie"]
>>> X_pred = vect.transform(test)
>>> model.predict(X_pred)
['POS']

编辑：请注意，第一个测试示例的正确分类只是一个幸运的巧合，因为我没有看到任何可以从训练集中学到的单词是否定的。在第二个示例中，单词 good 可能触发了正分类。

【讨论】：

非常感谢您帮助我理解。我想澄清的问题是labelining。如果我只是举个例子：1000000 training 句子，如果我只是手动标记这 1000000 个句子并将它们放在每个句子的末尾，我该如何标记训练句子并将其呈现给分类器呢？句子？。
如果你的意思是在文本文件中，那么你可以选择像|这样的分隔符，然后用pandas或cvs模块分开阅读就很容易了。
我的意思是这样的：training = [["first sentence, 'LABEL'"], ["second sentence, 'LABEL'"]... ["N opinion, 'LABEL'"]]。因为我想接收一个列表中的所有文档，每个列表一个文档，然后将它们矢量化。
从您的格式转换为我提供的格式并不难。你可以在这里寻求帮助。请问为什么你的数据有这么奇怪的格式？你的数据源是什么？似乎某处存在设计问题。
是的，CSV + pandas 用于小数据。如果您使用 .csv，这会更好 SPAM|this is spam。明确解析更容易，在处理自然文本时使用逗号作为分隔符可能会遇到麻烦。其他选择是 sqlite 数据库，甚至将两个列表（[sent1, sent2, ...] 和 [label1, label2, ...]）或元组列表（[(sent1, label1), (sent2, label2), ...]）腌制到磁盘。