仅从 sklearn CountVectorizer 稀疏矩阵中过滤某些单词答案

【问题标题】：Filter only certain words from sklearn CountVectorizer sparse matrix仅从 sklearn CountVectorizer 稀疏矩阵中过滤某些单词
【发布时间】：2016-07-27 01:33:30
【问题描述】：

我有一个熊猫系列，里面有很多文字。使用sklearn 包中的CountVectorizer 函数，我计算了稀疏矩阵。我也确定了最热门的词。现在我想过滤我的稀疏矩阵，只为那些最重要的词。

原始数据包含多于7000 行并且包含多于75000 字。因此我在这里创建一个示例数据

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
words = pd.Series(['This is first row of the text column',
                   'This is second row of the text column',
                   'This is third row of the text column',
                   'This is fourth row of the text column',
                   'This is fifth row of the text column'])
count_vec = CountVectorizer(stop_words='english')
sparse_matrix = count_vec.fit_transform(words)

我已经为该列中的所有单词创建了稀疏矩阵。这里只是为了打印我的稀疏矩阵，我使用.toarray() 函数将其转换为数组。

print count_vec.get_feature_names()
print sparse_matrix.toarray()
[u'column', u'fifth', u'fourth', u'row', u'second', u'text']
[[1 0 0 1 0 1]
 [1 0 0 1 1 1]
 [1 0 0 1 0 1]
 [1 0 1 1 0 1]
 [1 1 0 1 0 1]]

现在我正在使用以下内容查找经常出现的单词

# Get frequency count of all features
features_count = sparse_matrix.sum(axis=0).tolist()[0]
features_names = count_vec.get_feature_names()
features = pd.DataFrame(zip(features_names, features_count), 
                                columns=['features', 'count']
                               ).sort_values(by=['count'], ascending=False)

  features  count
0   column      5
3      row      5
5     text      5
1    fifth      1
2   fourth      1
4   second      1

从上面的结果我们知道出现频率高的词是column, row & text。现在我想只为这些词过滤我的稀疏矩阵。我不会将我的稀疏矩阵转换为数组然后过滤。因为我的原始数据中出现内存错误，因为字数非常多。

我能够获得稀疏矩阵的唯一方法是使用 vocabulary 属性再次使用这些特定单词重复这些步骤，就像这样

countvec_subset = CountVectorizer(vocabulary= ['column', 'text', 'row'])

相反，我正在寻找一个更好的解决方案，我可以直接为这些词过滤稀疏矩阵，而不是从头开始重新创建它。

【问题讨论】：

标签： python pandas scikit-learn sparse-matrix

【解决方案1】：

您可以对稀疏矩阵进行切片。您需要派生用于切片的列。 sparse_matrix[:, columns]

In [56]: feature_count = sparse_matrix.sum(axis=0)

In [57]: columns = tuple(np.where(feature_count == feature_count.max())[1])

In [58]: columns
Out[58]: (0, 3, 5)

In [59]: sparse_matrix[:, columns].toarray()
Out[59]:
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int64)

In [60]: type(sparse_matrix[:, columns])
Out[60]: scipy.sparse.csr.csr_matrix

In [71]: np.array(features_names)[list(columns)]
Out[71]:
array([u'column', u'row', u'text'],
      dtype='<U6')

切片的子集仍然是scipy.sparse.csr.csr_matrix

【讨论】：

知道如何根据计数选择最热门的n 字词吗？
np.array(feature_count)[0].argsort()[-4:][::-1] 应该会给你前 4 名。可能有更好的方法。