如何在单个文档中查找单词相关性？答案

【问题标题】：How to find words relevances in a single document?如何在单个文档中查找单词相关性？
【发布时间】：2019-08-22 17:32:39
【问题描述】：

我想在单个文档中查找某些词（如经济、技术）的相关性。

该文档大约有 30 页，其想法是提取所有文本并确定该文档的单词相关性。

我知道TF-IDF是用在一组文档中的，但是有没有可能用TF-IDF来解决这个问题呢？如果没有，我该如何在 Python 中做到这一点？

【问题讨论】：

您可以从更大的文件集合中构建一个 IDF 向量。您需要 一些东西 进行比较以确定基线。
TF-IDF 的 IDF 部分使这种方法违反直觉，因为它假设单个文档中的频率很高，但整个文档中的频率较低非常重要。只考虑词频并删除停用词可能会更好
也许使用汇总算法可行？

标签： python nltk word tf-idf tfidfvectorizer

【解决方案1】：

使用 NLTK 及其内置语料库，您可以对单词的“相关性”做出一些估计：

from collections import Counter
from math import log
from nltk import word_tokenize
from nltk.corpus import brown

toks = word_tokenize(open('document.txt').read().lower())
tf = Counter(toks)
freqs = Counter(w.lower() for w in brown.words())
n = len(brown.words())
for word in tf:
    tf[word] *= log(n / (freqs[word] + 1))**2    
for word, score in tf.most_common(10):
    print('%8.2f %s' % (score, word))

将document.txt更改为您的文档名称，脚本将输出其中最“相关”的十个单词。

【讨论】：