scikit learn：选择k个最佳特征后更新countvectorizer答案

【问题标题】：scikit learn: update countvectorizer after selecting k best featuresscikit learn：选择k个最佳特征后更新countvectorizer
【发布时间】：2014-09-16 08:10:42
【问题描述】：

我有一个具有大量特征的计数矢量化器，我希望能够从转换集中选择 k 个最佳特征，然后更新 count_vectorizer 以仅包含这些特征。这可能吗？

import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats as ss
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

merge=re.compile('\*\|.+?\|\*')
def stripmerge(sub):
    for i in merge.findall(sub):
        j=i
        j=j.replace('*|','mcopen')
        j=j.replace('|*','mcclose')
        j=re.sub('[^0-9a-zA-Z]','',j)
        sub=sub.replace(i,j)
    return sub

input=pd.read_csv('subject_tool_test_23.csv')
input.subject[input.subject.isnull()]=' '


subjects=np.asarray([stripmerge(i) for i in input.subject])
count_vectorizer = CountVectorizer(strip_accents='unicode', ngram_range=(1,1), binary=True, stop_words='english', max_features=500)
counts=count_vectorizer.fit_transform(subjects)

#see the first output example here

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

good=np.asarray(input.unique_open_performance>0)

count_new = SelectKBest(chi2, k=250).fit_transform(counts, good)

第一个输出示例，特征有意义

>>> counts[1]
<1x500 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> subjects[1]
"Lake Group Media's Thursday Target"
>>> count_vectorizer.inverse_transform(counts[1])
[array([u'group', u'media', u'thursday'], 
      dtype='<U18')]

第二个输出示例，特征不再匹配。

>>> count_new = SelectKBest(chi2, k=250).fit_transform(counts, good)
>>> count_new.shape
(992979, 250)
>>> count_new[1]
<1x250 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>
>>> count_vectorizer.inverse_transform(count_new[1])
[array([u'independence', u'easy'], 
      dtype='<U18')]
>>> subjects[1]
"Lake Group Media's Thursday Target"

有没有办法将特征选择结果应用于我的计数矢量化器，以便我可以生成仅包含重要特征的新向量？

【问题讨论】：

我正在尝试的一个潜在解决方案是使用在下面链接的帖子中找到的信息来生成单词列表，然后将其用作新计数矢量化器的字典。它既不优雅也不高效，但我认为它可以完成工作。 stackoverflow.com/questions/14133348/…
这样做的动机是什么？
因为使用count vectorizer创建的特征列表太大。我想选择看起来最相关的功能并使用这些功能。
为什么不使用计数向量器和特征选择的管道？
我对 scikit 比较陌生，我不知道这个选项。我会尝试一下。谢谢。

标签： python machine-learning scikit-learn text-processing

【解决方案1】：

我解决这个问题的方法是运行特征选择，确定从原始集合中选择了哪些列，从中创建一个字典，然后运行一个仅限于该字典的新计数矢量化器。大型数据集需要更长的时间，但它确实有效。

ch2 = SelectKBest(chi2, k = 3000)

count_new = ch2.fit_transform(counts, good)
dict=np.asarray(count_vectorizer.get_feature_names())[ch2.get_support()]
count_vectorizer=CountVectorizer(strip_accents='unicode', ngram_range=(1,1), binary=True,  vocabulary=dict)

【讨论】：

【解决方案2】：

我相信这就是您正在寻找的。它是一个修改过的 SelectKBest 对象，可以转换一个词汇表对象（term：index dict）或一个 CountVectorizer 对象并更新其词汇表。无需重新提取所有特征。

from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np

class CustomSelectKBest(SelectKBest):
  """
    Extending SelectKBest with the ability to update a vocabulary that is given
    from a CountVectorizer object.
  """
  def __init__(self, score_func=f_classif, k=10):
    super(CustomSelectKBest, self).__init__(score_func, k)

  def transform_vocabulary(self, vocabulary):
    mask  = self.get_support(True)
    i_map = { j:i for i, j in enumerate(mask) }
    return { k:i_map[i] for k, i in vocabulary.iteritems() if i in i_map }

  def transform_vectorizer(self, cv):
    cv.vocabulary_ = self.transform_vocabulary(cv.vocabulary_)

if __name__ == '__main__':
  def score_func(X, y):
    # Fake scores and p-values
    return (np.arange(X.shape[1]), np.zeros(X.shape[1]))

  # Create test data.
  size = (4, 10)
  X = (np.random.randint(0,5, size=size))
  y = np.random.randint(2, size=size[0])
  vocabulary = {chr(i+ord('a')):i for i in range(size[1])}

  skb = CustomSelectKBest(score_func=score_func, k=5)
  X_s = skb.fit_transform(X, y)
  vocab_s = skb.transform_vocabulary(vocabulary)

  # Confirm they have the right values.
  for k, i_s in vocab_s.iteritems():
    i = vocabulary[k]
    assert((X_s[:,i_s].T == X[:,i].T).all())

  print 'Test passed'

【讨论】：

我使用了这个答案，所以我可以将转换后的词汇表保存到一个 json 文件中，以便在另一个运行中重用。 np.asarray 的公认答案显然是“不可序列化的 json”。

【解决方案3】：

使用 Pipeline 让您的生活更轻松。 Pipeline 将自动对测试数据应用转换。您不必手动重新创建矢量化器。

text_clf_red = Pipeline([('vect', CountVectorizer()), 
                       ('reducer', SelectKBest(chi2, k=3000)),
                       ('clf', MultinomialNB())
                       ])

text_clf_red.fit(X_train, y_train)
y_test_pred = text_clf_red.predict(X_test)
metrics.accuracy_score(y_test, y_test_pred)

【讨论】：