【发布时间】:2015-02-27 02:52:02
【问题描述】:
我正在使用 scikit-learn 提供的不同分类器和矢量化器,所以假设我有以下内容:
training = [["this was a good movie, 'POS'"],
["this was a bad movie, 'NEG'"],
["i went to the movies, 'NEU'"],
["this movie was very exiting it was great, 'POS'"],
["this is a boring film, 'NEG'"]
,........................,
[" N-sentence, 'LABEL'"]]
#Where each element of the list is another list that have documents, then.
splitted = [#remove the tags from training]
from sklearn.feature_extraction.text import HashingVectorizer
X = HashingVectorizer(
tokenizer=lambda doc: doc, lowercase=False).fit_transform(splitted)
print X.toarray()
然后我有这个向量表示:
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
这个问题是我不知道我是否对语料库进行了正确的向量化,那么:
#This is the test corpus:
test = ["I don't like this movie it sucks it doesn't liked me"]
#I vectorize the corpus with hashing vectorizer
Y = HashingVectorizer(
tokenizer=lambda doc: doc, lowercase=False).fit_transform(test)
然后我打印Y:
[[ 0. 0. 0. ..., 0. 0. 0.]]
然后
y = [x[-1]for x in training]
#import SVM and classify
from sklearn.svm import SVC
svm = SVC()
svm.fit(X, y)
result = svm.predict(X)
print "\nThe opinion is:\n",result
这就是问题所在,我在 [NEG] 中插入了以下内容,这实际上是正确的预测:
["this was a good movie, 'POS'"]
我想我没有正确矢量化 training 或 y 目标是错误的,谁能帮助我了解正在发生的事情以及我应该如何矢量化 training 以获得正确的预测?
【问题讨论】:
标签: python machine-learning nlp scikit-learn nltk