您如何将语料库中的所有单词包含在 Gensim TF-IDF 中？答案

【问题标题】：How do you include all words from the corpus in a Gensim TF-IDF?您如何将语料库中的所有单词包含在 Gensim TF-IDF 中？
【发布时间】：2020-03-16 19:28:07
【问题描述】：

如果我有一些这样的文件：

doc1 = "hello hello this is a document"
doc2 = "this text is very interesting"
documents = [doc1, doc2]

我在 Gensim 中为此计算一个 TF-IDF 矩阵，如下所示：

# create dictionary
dictionary = corpora.Dictionary([simple_preprocess(line) for line in documents])
# create bow corpus
corpus = [dictionary.doc2bow(simple_preprocess(line)) for line in documents]
# create the tf.idf matrix
tfidf = models.TfidfModel(corpus, smartirs='ntc')

然后对于每个文档，我得到一个这样的 TF-IDF：

Doc1: [("hello", 0.5), ("a", 0.25), ("document", 0.25)]
Doc2: [("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

但我希望每个文档的 TF-IDF 向量包含 TF-IDF 值为 0 的单词（即包含语料库中提到的每个单词）：

Doc1: [("hello", 0.5), ("this", 0), ("is", 0), ("a", 0.25), ("document", 0.25), ("text", 0), ("very", 0), ("interesting", 0)]
Doc2: [("hello", 0), ("this", 0), ("is", 0), ("a", 0), ("document", 0), ("text", 0.333), ("very", 0.333), ("interesting", 0.333)]

如何在 Gensim 中执行此操作？或者也许还有其他一些库可以以这种方式计算 TF-IDF 矩阵（尽管像 Gensim 一样，它需要能够处理非常大的数据集，例如，我在 Sci-kit 中在一个小数据集上实现了这个结果，但 Sci-kit 在大型数据集上存在内存问题）。

【问题讨论】：

标签： python nlp gensim text-classification tf-idf

【解决方案1】：

您可以使用sklearn.TfidfVectorizer 来执行此操作。只需四行即可完成，如下所示：

>>> import pandas as pd
>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = ["hello hello this is a document", "this text is very interesting"]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>>df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
>>> df
   document     hello  interesting       is      text     this      very
0  0.407824  0.815648     0.000000  0.29017  0.000000  0.29017  0.000000
1  0.000000  0.000000     0.499221  0.35520  0.499221  0.35520  0.499221

编辑

您可以使用Sparse2Matrix 将 tfidf 矩阵转换回 gensim，如下所示：

>>> from gensim import matutils
>>> tfidf_mat = matutils.Sparse2Corpus(X, documents_columns=False)

希望对你有帮助

【讨论】：

确实，但是我的语料库中有 70,000 个文档，并且 sk-learn 尝试将 tfidf 矩阵存储在内存中，这对于如此大的语料库是不可能的，因为 tfidf 矩阵不适合记忆。因此我使用 Gensim，因为 Gensim 将 tfidf 矩阵存储在磁盘中。