【问题标题】:Problems classifiying labeled text, wrong prediction?分类标记文本的问题,错误的预测?
【发布时间】:2015-02-27 02:52:02
【问题描述】:

我正在使用 scikit-learn 提供的不同分类器和矢量化器,所以假设我有以下内容:

training = [["this was a good movie, 'POS'"],
      ["this was a bad movie, 'NEG'"],
      ["i went to the movies, 'NEU'"], 
      ["this movie was very exiting it was great, 'POS'"], 
      ["this is a boring film, 'NEG'"]
        ,........................,
          [" N-sentence, 'LABEL'"]]

#Where each element of the list is another list that have documents, then.

splitted = [#remove the tags from training]

from sklearn.feature_extraction.text import HashingVectorizer
X = HashingVectorizer(
    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(splitted)

print X.toarray()

然后我有这个向量表示:

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

这个问题是我不知道我是否对语料库进行了正确的向量化,那么:

#This is the test corpus:
test = ["I don't like this movie it sucks it doesn't liked me"]

#I vectorize the corpus with hashing vectorizer
Y = HashingVectorizer(
    tokenizer=lambda  doc: doc, lowercase=False).fit_transform(test)

然后我打印Y:

[[ 0.  0.  0. ...,  0.  0.  0.]]

然后

y = [x[-1]for x in training]

#import SVM and classify
from sklearn.svm import SVC
svm = SVC()
svm.fit(X, y)
result = svm.predict(X)
print "\nThe opinion is:\n",result

这就是问题所在,我在 [NEG] 中插入了以下内容,这实际上是正确的预测:

["this was a good movie, 'POS'"]

我想我没有正确矢量化 trainingy 目标是错误的,谁能帮助我了解正在发生的事情以及我应该如何矢量化 training 以获得正确的预测?

【问题讨论】:

    标签: python machine-learning nlp scikit-learn nltk


    【解决方案1】:

    我会让你把训练数据变成预期的格式:

    training = ["this was a good movie",
                "this was a bad movie",
                "i went to the movies",
                "this movie was very exiting it was great", 
                "this is a boring film"]
    
    labels = ['POS', 'NEG', 'NEU', 'POS', 'NEG']
    

    特征提取

    >>> from sklearn.feature_extraction.text import HashingVectorizer
    >>> vect = HashingVectorizer(n_features=5, stop_words='english', non_negative=True)
    >>> X_train = vect.fit_transform(training)
    >>> X_train.toarray()
    [[ 0.          0.70710678  0.          0.          0.70710678]
     [ 0.70710678  0.70710678  0.          0.          0.        ]
     [ 0.          0.          0.          0.          0.        ]
     [ 0.          0.89442719  0.          0.4472136   0.        ]
     [ 1.          0.          0.          0.          0.        ]]
    

    对于更大的语料库,您应该增加n_features 以避免冲突,我使用了 5 以便可以可视化生成的矩阵。另请注意,我使用了stop_words='english',我认为在这么少的例子中去掉停用词很重要,否则你可能会混淆分类器。

    模型训练

    from sklearn.svm import SVC
    
    model = SVC()
    model.fit(X_train, labels)
    

    预测

    >>> test = ["I don't like this movie it sucks it doesn't liked me"]
    >>> X_pred = vect.transform(test)
    >>> model.predict(X_pred)
    ['NEG']
    
    >>> test = ["I think it was a good movie"]
    >>> X_pred = vect.transform(test)
    >>> model.predict(X_pred)
    ['POS']
    

    编辑:请注意,第一个测试示例的正确分类只是一个幸运的巧合,因为我没有看到任何可以从训练集中学到的单词是否定的。在第二个示例中,单词 good 可能触发了正分类。

    【讨论】:

    • 非常感谢您帮助我理解。我想澄清的问题是labelining。如果我只是举个例子:1000000 training 句子,如果我只是手动标记这 1000000 个句子并将它们放在每个句子的末尾,我该如何标记训练句子并将其呈现给分类器呢?句子?。
    • 如果你的意思是在文本文件中,那么你可以选择像|这样的分隔符,然后用pandas或cvs模块分开阅读就很容易了。
    • 我的意思是这样的:training = [["first sentence, 'LABEL'"], ["second sentence, 'LABEL'"]... ["N opinion, 'LABEL'"]]。因为我想接收一个列表中的所有文档,每个列表一个文档,然后将它们矢量化。
    • 从您的格式转换为我提供的格式并不难。你可以在这里寻求帮助。请问为什么你的数据有这么奇怪的格式?你的数据源是什么?似乎某处存在设计问题。
    • 是的,CSV + pandas 用于小数据。如果您使用 .csv,这会更好 SPAM|this is spam。明确解析更容易,在处理自然文本时使用逗号作为分隔符可能会遇到麻烦。其他选择是 sqlite 数据库,甚至将两个列表([sent1, sent2, ...][label1, label2, ...])或元组列表([(sent1, label1), (sent2, label2), ...])腌制到磁盘。
    猜你喜欢
    • 2019-05-29
    • 2020-09-24
    • 2019-04-19
    • 1970-01-01
    • 1970-01-01
    • 2023-01-01
    • 2021-09-22
    • 2015-01-19
    相关资源
    最近更新 更多