返回 CountVectorizer 中对 scikit learn 中的特定特征具有非零条目的行的索引答案

【问题标题】：return indicies of rows in a CountVectorizer that have non-zero entries for a particular feature in scikit learn返回 CountVectorizer 中对 scikit learn 中的特定特征具有非零条目的行的索引
【发布时间】：2014-04-19 08:00:42
【问题描述】：

我一直在搜索 Python 的 sklearn 包的文档。

我用我的语料库创建了一个 CountVectorizer 对象，经过拟合和转换。

我正在寻找一个函数，它可以为某些特定列返回具有非零条目的所有行的索引。

因此，如果我的 CountVectorizer 中的行包含音乐评论，而列包含特征，（例如，有一列用于计数单词“lyrics”），sci kit-learn 中是否有一个函数可以返回包含该词的音乐评论的索引吗？

我查看了inverse_transform(X) 函数，它没有执行此函数。

我怀疑我不是第一个对此功能感兴趣的人。

sklearn 中是否存在这样的功能，如果没有，是否有其他对类似程序感兴趣的人提出了如何实现此功能的好方法？

提前致谢。

更新：

我最好的解决方案是迭代列数（在我的例子中，我有 100 个特征）：

for i in range(99):
    print X.indices[X.indptr[i]:X.indptr[i+1]]

但这看起来很浪费，因为它是迭代的并且范围必须是硬编码的，并且它为稀疏列返回空列表。

【问题讨论】：

标签： python scikit-learn word-frequency

【解决方案1】：

我在文档中也没有看到可以做到这一点的函数，但这应该对你有用：

def lookUpWord(vec,dtm,word):
    i = vec.get_feature_names().index(word)
    return dtm[:,i].nonzero()[0]

这是一个简单的例子：

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> 
>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?'
...     ]
>>> 
>>> X = CountVectorizer()
>>> Y = X.fit_transform(corpus)
>>> lookUpWord(X,Y,'first')
array([0, 3], dtype=int32)

【讨论】：