使用 Python 将字数文件转换为稀疏矩阵答案

【问题标题】：use Python to convert files of word counts to sparse matrix使用 Python 将字数文件转换为稀疏矩阵
【发布时间】：2015-06-20 10:07:45
【问题描述】：

我有一系列文件，每个文件都包含字数。每个文件可以有不同的单词。这是一个例子：

文件A

word1,20
word2,10
word3,2

文件B：

word1,10
word4,50
word3,5

大约有 20k 个文件，每个文件最多可以包含数万个单词。

我最终想要构建一个稀疏矩阵，其中每一行代表一个文件的单词分布，就像你从 scikit's CountVectorizer 中得到的一样。

如果 word1、word2、word3、word4 是列，anf FileA 和 FileB 是行，那么我希望得到：

[[20,10,2,0],[10,0,5,50]]

我怎么能做到这一点？如果可能的话，我还希望能够仅包含出现在至少 N 个文件中的单词。

【问题讨论】：

stackoverflow.com/questions/1938894/… 有一个广受好评的答案。我认为 N 文件要求是一个棘手的要求。生成两个矩阵，一个带字数，一个带文件数，后面用后者作为前者的掩码？您可以相对轻松地调整 N，这似乎很有用。

标签： python nlp sparse-matrix

【解决方案1】：

您可以使用一些字典将单词映射到它们出现的频率，并将文件名映射到这些文件中的字数。

files = ["file1", "file2"]
all_words = collections.defaultdict(int)
all_files = collections.defaultdict(dict)

for filename in files:
    with open(filename) as f:
        for line in f:
            word, count = line.split(",")
            all_files[filename][word] = int(count)
            all_words[word] += 1

然后您可以使用嵌套列表推导中的那些来创建稀疏矩阵：

>>> [[all_files[f].get(w, 0) for w in sorted(all_words)] for f in files]
[[20, 10, 2, 0], [10, 0, 5, 50]]

或者按最小字数过滤：

>>> [[all_files[f].get(w, 0) for w in sorted(all_words) if all_words[w] > 1] for f in files]
[[20, 2], [10, 5]]

【讨论】：