使用 scikit learn 获取信息量最大的特征时遇到问题？答案

【问题标题】：Problems obtaining most informative features with scikit learn?使用 scikit learn 获取信息量最大的特征时遇到问题？
【发布时间】：2015-07-13 02:13:32
【问题描述】：

我正在尝试从textual corpus 获取信息量最大的功能。从这个很好的回答question我知道这个任务可以完成如下：

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

然后：

most_informative_feature_for_class(tfidf_vect, clf, 5)

对于这个分类器：

X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values


from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
                                                    y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)

问题是most_informative_feature_for_class的输出：

5 a_base_de_bien bastante   (0, 2451)   -0.210683496368
  (0, 3533) -0.173621065386
  (0, 8034) -0.135543062425
  (0, 10346)    -0.173621065386
  (0, 15231)    -0.154148294738
  (0, 18261)    -0.158890483047
  (0, 21083)    -0.297476572586
  (0, 434)  -0.0596263855375
  (0, 446)  -0.0753492277856
  (0, 769)  -0.0753492277856
  (0, 1118) -0.0753492277856
  (0, 1439) -0.0753492277856
  (0, 1605) -0.0753492277856
  (0, 1755) -0.0637950312345
  (0, 3504) -0.0753492277856
  (0, 3511) -0.115802483001
  (0, 4382) -0.0668983049212
  (0, 5247) -0.315713152154
  (0, 5396) -0.0753492277856
  (0, 5753) -0.0716096348446
  (0, 6507) -0.130661516772
  (0, 7978) -0.0753492277856
  (0, 8296) -0.144739048504
  (0, 8740) -0.0753492277856
  (0, 8906) -0.0753492277856
  : :
  (0, 23282)    0.418623443832
  (0, 4100) 0.385906085143
  (0, 15735)    0.207958503155
  (0, 16620)    0.385906085143
  (0, 19974)    0.0936828782325
  (0, 20304)    0.385906085143
  (0, 21721)    0.385906085143
  (0, 22308)    0.301270427482
  (0, 14903)    0.314164150621
  (0, 16904)    0.0653764031957
  (0, 20805)    0.0597723455204
  (0, 21878)    0.403750815828
  (0, 22582)    0.0226150073272
  (0, 6532) 0.525138162099
  (0, 6670) 0.525138162099
  (0, 10341)    0.525138162099
  (0, 13627)    0.278332617058
  (0, 1600) 0.326774799211
  (0, 2074) 0.310556919237
  (0, 5262) 0.176400451433
  (0, 6373) 0.290124806858
  (0, 8593) 0.290124806858
  (0, 12002)    0.282832270298
  (0, 15008)    0.290124806858
  (0, 19207)    0.326774799211

它既不返回标签也不返回单词。为什么会发生这种情况，如何打印文字和标签？自从我使用熊猫读取数据以来，你们是否正在发生这种情况？我尝试的另一件事如下，形成这个question：

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))


print_top10(tfidf_vect,clf,y)

但我得到了这个回溯：

Traceback（最近一次调用最后一次）：

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
    print_top10(tfidf_vect,clf,5)
  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
    for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable

知道如何解决这个问题，以获得具有最高系数值的特征吗？

【问题讨论】：

标签： python pandas machine-learning nlp scikit-learn

【解决方案1】：

要专门针对线性 SVM 解决这个问题，我们首先要了解 sklearn 中 SVM 的公式以及它与 MultinomialNB 的区别。

most_informative_feature_for_class 适用于 MultinomialNB 的原因是，coef_ 的输出本质上是给定类的特征的对数概率（因此大小为[nclass, n_features]，由于天真的公式贝叶斯问题。但是如果我们检查 SVM 的 documentation，则 coef_ 并不是那么简单。相反，（线性）SVM 的 coef_ 是 [n_classes * (n_classes -1)/2, n_features]，因为每个二进制模型都适合每个可能的类。

如果我们确实了解我们感兴趣的特定系数，我们可以将函数更改为如下所示：

def most_informative_feature_for_class_svm(vectorizer, classifier,  classlabel, n=10):
    labelid = ?? # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

这将按预期工作，并根据您所追求的系数向量打印出标签和前 n 个特征。

至于获得特定类的正确输出，这将取决于假设和您的目标输出。我建议通读 SVM 文档中的多类文档，以了解您所追求的。

因此，使用question 中描述的train.txt file，我们可以获得某种输出，尽管在这种情况下它不是特别具有描述性或有助于解释。希望这对您有所帮助。

import codecs, re, time
from itertools import chain

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

from sklearn.svm import SVC
svcc = SVC(kernel='linear', C=1)
svcc.fit(trainset, tags)

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
print 
most_informative_feature_for_class_svm(word_vectorizer, svcc)

带输出：

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

no 0.0204081632653
parecer 0.0204081632653
pone 0.0204081632653
por 0.0204081632653
relación 0.0204081632653
una 0.0204081632653
visto 0.0204081632653
ya 0.0204081632653
es 0.0408163265306
lo 0.0408163265306

【讨论】：

感谢这个惊人的答案。应用相同的过程但使用多项式或 rbf 内核怎么样？
据我了解，我不太确定多项式或 rbf 内核是否可以推广并用于特征排名。我认为here 的问题可能会让您对 SVM 和权重的含义有更好的直觉。一般来说，除线性之外的 SVM 的结果是不平凡的，这就是为什么 sklearn 中的多项式或 rbf 内核不存在 coef 属性的原因。
感谢您的支持！
您可以使用显式多项式展开和线性分类器，然后对其进行特征分析。 scikit-learn.org/dev/modules/generated/…
您可以通过使用 LinearSVC btw 来解决 coef_ 的问题。