对于 scikit-learn 中的每个文件，如何从 TF-idf 向量中获取最高频率项？答案

【问题标题】：How can i get highest frequency terms out of TD-idf vectors , for each files in scikit-learn?对于 scikit-learn 中的每个文件，如何从 TF-idf 向量中获取最高频率项？
【发布时间】：2012-10-22 07:06:34
【问题描述】：

我正在尝试从 scikit-learn 中的向量中获取最高频率项。从示例中可以对每个类别使用它，但我希望它用于类别中的每个文件。

https://github.com/scikit-learn/scikit-learn/blob/master/examples/document_classification_20newsgroups.py

    if opts.print_top10:
        print "top 10 keywords per class:"
        for i, category in enumerate(categories):
            top10 = np.argsort(clf.coef_[i])[-10:]
            print trim("%s: %s" % (
            category, " ".join(feature_names[top10])))

我想对测试数据集中的每个文件而不是每个类别执行此操作。我应该去哪里看？

谢谢

编辑：s/discriminitive/highest frequency/g（抱歉混淆）

【问题讨论】：

你不能只用用来解析训练数据的同一个矢量化器来转换你的测试数据。在调用fit 和transform 后，矢量化器会存储词汇表，并使用该词汇表过滤您传入的任何数据（根据文档）。
Vocabulary 不存储关于它来自哪个文档（或数组/列表索引）的任何信息。它只是 Volcabulary ，如果你查看 scikit-learn 源代码，你会看到。

标签： python parsing machine-learning classification scikit-learn

【解决方案1】：

您可以将转换的结果与get_feature_names 一起使用来获取给定文档的术语计数。

X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])

【讨论】：

经过测试和纠正。我正要发布几乎相同的答案:)
get_feature_names 表示 vectorizer.get_feature_names() ?
terms = np.array(vectorizer.get_feature_names()) first_top = zip(terms, X_test.toarray()[0]) 这还不行。
它检索所有可用的术语啊！
@V3ss0n：这些不是歧视性术语，它们只是高频术语。使用 sorted、heap.nlargest 或任何你喜欢的 Python 技巧从 terms_for_first_doc 中获取你想要的术语：stackoverflow.com/a/13070505/166749

【解决方案2】：

似乎没有人知道。我在这里回答，因为其他人面临同样的问题，我现在在哪里寻找，还没有完全实现。

它位于 sklearn.feature_extraction.text 的 CountVectorizer 深处：

def transform(self, raw_documents):
    """Extract token counts out of raw text documents using the vocabulary
    fitted with fit or the one provided in the constructor.

    Parameters
    ----------
    raw_documents: iterable
        an iterable which yields either str, unicode or file objects

    Returns
    -------
    vectors: sparse matrix, [n_samples, n_features]
    """
    if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0:
        raise ValueError("Vocabulary wasn't fitted or is empty!")

    # raw_documents can be an iterable so we don't know its size in
    # advance

    # XXX @larsmans tried to parallelize the following loop with joblib.
    # The result was some 20% slower than the serial version.
    analyze = self.build_analyzer()
    term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here
    self.test_term_counts_per_doc=deepcopy(term_counts_per_doc)
    return self._term_count_dicts_to_matrix(term_counts_per_doc)

我添加了 self.test_term_counts_per_doc=deepcopy(term_counts_per_doc) 并且它可以像这样从外部矢量化器调用：

load_files = recursive_load_files
trainer_path = os.path.realpath(trainer_path)
tester_path = os.path.realpath(tester_path)
data_train = load_files(trainer_path, load_content = True, shuffle = False)


data_test = load_files(tester_path, load_content = True, shuffle = False)
print 'data loaded'

categories = None    # for case categories == None

print "%d documents (training set)" % len(data_train.data)
print "%d documents (testing set)" % len(data_test.data)
#print "%d categories" % len(categories)
print

# split a training set and a test set

print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7,
                             stop_words='english',charset_error="ignore")

X_train = vectorizer.fit_transform(data_train.data)


print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
print

print "Extracting features from the test dataset using the same vectorizer"
t0 = time()
X_test = vectorizer.transform(data_test.data)
print "Test printing terms per document"
for counter in vectorizer.test_term_counts_per_doc:
    print counter

这是我的分叉，我也提交了拉取请求：

https://github.com/v3ss0n/scikit-learn

如果有更好的方法，请建议我。

【讨论】：

为什么 -1 ，它是一个工作灵魂（但需要修改 scikit-learn）
享受你对 SO.Trolls 投反对票