【发布时间】:2015-07-13 02:13:32
【问题描述】:
我正在尝试从textual corpus 获取信息量最大的功能。从这个很好的回答question我知道这个任务可以完成如下:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
labelid = list(classifier.classes_).index(classlabel)
feature_names = vectorizer.get_feature_names()
topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
for coef, feat in topn:
print classlabel, feat, coef
然后:
most_informative_feature_for_class(tfidf_vect, clf, 5)
对于这个分类器:
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)
问题是most_informative_feature_for_class的输出:
5 a_base_de_bien bastante (0, 2451) -0.210683496368
(0, 3533) -0.173621065386
(0, 8034) -0.135543062425
(0, 10346) -0.173621065386
(0, 15231) -0.154148294738
(0, 18261) -0.158890483047
(0, 21083) -0.297476572586
(0, 434) -0.0596263855375
(0, 446) -0.0753492277856
(0, 769) -0.0753492277856
(0, 1118) -0.0753492277856
(0, 1439) -0.0753492277856
(0, 1605) -0.0753492277856
(0, 1755) -0.0637950312345
(0, 3504) -0.0753492277856
(0, 3511) -0.115802483001
(0, 4382) -0.0668983049212
(0, 5247) -0.315713152154
(0, 5396) -0.0753492277856
(0, 5753) -0.0716096348446
(0, 6507) -0.130661516772
(0, 7978) -0.0753492277856
(0, 8296) -0.144739048504
(0, 8740) -0.0753492277856
(0, 8906) -0.0753492277856
: :
(0, 23282) 0.418623443832
(0, 4100) 0.385906085143
(0, 15735) 0.207958503155
(0, 16620) 0.385906085143
(0, 19974) 0.0936828782325
(0, 20304) 0.385906085143
(0, 21721) 0.385906085143
(0, 22308) 0.301270427482
(0, 14903) 0.314164150621
(0, 16904) 0.0653764031957
(0, 20805) 0.0597723455204
(0, 21878) 0.403750815828
(0, 22582) 0.0226150073272
(0, 6532) 0.525138162099
(0, 6670) 0.525138162099
(0, 10341) 0.525138162099
(0, 13627) 0.278332617058
(0, 1600) 0.326774799211
(0, 2074) 0.310556919237
(0, 5262) 0.176400451433
(0, 6373) 0.290124806858
(0, 8593) 0.290124806858
(0, 12002) 0.282832270298
(0, 15008) 0.290124806858
(0, 19207) 0.326774799211
它既不返回标签也不返回单词。为什么会发生这种情况,如何打印文字和标签?自从我使用熊猫读取数据以来,你们是否正在发生这种情况?我尝试的另一件事如下,形成这个question:
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
print_top10(tfidf_vect,clf,y)
但我得到了这个回溯:
Traceback(最近一次调用最后一次):
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
print_top10(tfidf_vect,clf,5)
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable
知道如何解决这个问题,以获得具有最高系数值的特征吗?
【问题讨论】:
标签: python pandas machine-learning nlp scikit-learn