用于数据过滤的 TF-IDF答案

【问题标题】：TF-IDF for data filtering用于数据过滤的 TF-IDF
【发布时间】：2018-06-19 12:21:06
【问题描述】：

我有一个原始文档列表，已经过滤并删除了英文停用词：

rawDocument = ['sport british english sports american english includes forms competitive physical activity games casual organised ...', 'disaster serious disruption occurring relatively short time functioning community society involving ...', 'government system group people governing organized community often state case broad associative definition ...', 'technology science craft greek τέχνη techne art skill cunning hand λογία logia collection techniques ...']

我用过

from sklearn.feature_extraction.text import TfidfVectorizer
sklearn_tfidf = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=False)
sklearn_representation = sklearn_tfidf.fit_transform(rawDocuments)

但我有一个

<4x50 sparse matrix of type '<class 'numpy.float64'>'
    with 51 stored elements in Compressed Sparse Row format>

我无法解释结果。那么，我是在使用正确的工具还是需要改变方式？

我的目标是获取每个文档中的相关单词，以便与查询文档中的其他单词进行余弦相似度。

提前谢谢你。

【问题讨论】：

做sklearn_representation.todense()，你会得到一个矩阵。由于内存原因，输出是一个稀疏矩阵。您必须将其转换为密集矩阵。对于大型数据集，不建议这样做
sklearn_representation.todense() 等于sklearn_representation.toarray()？我无法解释为什么我的矩阵中有这些值，以及[0][3] 中的值与哪个词相关。
你不会直接得到这个。执行sklearn_tfidf.get_feature_names() 获取单词名称，这些将是您的列标题，然后每一行都只是一个文档。您必须自己整合这些信息

标签： python scikit-learn tf-idf tfidfvectorizer

【解决方案1】：

Pandas 模块通常可用于更好地可视化您的数据：

演示：

import pandas as pd

df = pd.SparseDataFrame(sklearn_tfidf.fit_transform(rawDocument),
                        columns=sklearn_tfidf.get_feature_names(),
                        default_fill_value=0)

结果：

In [85]: df
Out[85]:
   activity  american       art  associative  british    ...       system    techne  techniques  technology      time
0      0.25      0.25  0.000000     0.000000     0.25    ...     0.000000  0.000000    0.000000    0.000000  0.000000
1      0.00      0.00  0.000000     0.000000     0.00    ...     0.000000  0.000000    0.000000    0.000000  0.308556
2      0.00      0.00  0.000000     0.282804     0.00    ...     0.282804  0.000000    0.000000    0.000000  0.000000
3      0.00      0.00  0.288675     0.000000     0.00    ...     0.000000  0.288675    0.288675    0.288675  0.000000

[4 rows x 48 columns]

【讨论】：

我可以使用原始文档中的单个标记作为列索引，具有这种数据类型吗？例如，我能否以这种方式在第三行获取“关联”令牌的 tf-idf df[2]['ability']？
@FedericoCuozzo，您可以通过以下几种方式做到这一点：df.loc[2, 'associative'] 或 df.at[2, 'associative']。您可能想阅读this...