获取每个文档的最高术语 - scikit tf-idf答案

【问题标题】：Get the top term per document - scikit tf-idf获取每个文档的最高术语 - scikit tf-idf
【发布时间】：2019-06-09 10:04:42
【问题描述】：

在使用scikit's tf-idf vectorizer 对多个文档进行矢量化后，有没有办法让每个文档获得最“有影响力”的术语？

我只找到了为整个语料库而不是每个文档获取最“有影响力”的术语的方法。

【问题讨论】：

您如何定义每个文档中最有影响力的术语？具体来说，它和文档中tf-idf最高的词有什么区别？
要么在每个文档本身而不是整个语料库上使用 td-idf，要么通过新文档的词汇在整个语料库上过滤 td-idf-results。
@AmiTavory 我想这就是我真正想要的。我不确定如何为每个文档获取具有最高 tf-idf 的单词。抱歉，我对此还是很陌生

标签： python scikit-learn tf-idf

【解决方案1】：

假设您从数据集开始：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.datasets import fetch_20newsgroups

d = fetch_20newsgroups()

使用Count Vectorizer和TFIDF：

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(d.data)
transformer = TfidfTransformer()
X_train_tfidf = transformer.fit_transform(X_train_counts)

现在您可以创建一个反向映射：

m = {v: k for (k, v) in count_vect.vocabulary_.items()}

，这给出了每个doc的有影响力的词：

[m[t] for t in np.array(np.argmax(X_train_tfidf, axis=1)).flatten()]

【讨论】：

感谢您的帮助！非常感谢！ span>
没问题。一切顺利。 span>
嗨，你能解释最后一行如何给出每个doc的有影响力的词？我正在打印那条线，它只是从总表中提供前10个单词.. span>

【解决方案2】：

只是在Ami的最后两个步骤中再添加一种方法：

# Get a list of all the keywords by calling function
feature_names = np.array(count_vect.get_feature_names())
feature_names[X_train_tfidf.argmax(axis=1)]

【讨论】：

非常好-----------