“词频”和“文档频率”有什么区别？答案

【问题标题】：what is the difference between 'term frequency' and 'document frequency'?“词频”和“文档频率”有什么区别？
【发布时间】：2016-04-23 07:58:11
【问题描述】：

编辑：这是我最终想问的问题：Understanding min_df and max_df in scikit CountVectorizer

我正在阅读 scikit-learn CountVectorizer 的文档，并注意到在讨论 max_df 时，我们关心令牌的文档频率：

max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

但是当我们考虑max_features时，我们对词频感兴趣：

max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

我很困惑：如果我们使用max_df，并说我们将其设置为 10，我们不是说“忽略任何出现超过 10 次的令牌”吗？

如果我们将max_features 设置为100，我们不是说“只使用在整个语料库中出现次数最多的100 个标记”吗？

如果我猜对了……那么使用“词频”和“文档频率”时的措辞有什么区别？

【问题讨论】：

它们几乎就像它在锡上所说的那样 - 文档频率是 documents 的频率（包含该术语的文档占所有文档的一部分），术语频率是频率条款。
en.wikipedia.org/wiki/Tf%E2%80%93idf
我不明白你所说的“措辞不同”。
@pvg 所以如果一个词的“文档频率”为 0.5，这意味着它出现在语料库中的一半文本中？如果我们使用 max_df = 0.5，这肯定会与 idf 中的值混淆
@MonicaHeddneck：如果你不加选择地使用max_df，那么是的，这就是为什么max_df被应用于“语料库特定的停用词”。

标签： python scikit-learn tf-idf

【解决方案1】：

当您将max_df 设置为 10 时，您会说“忽略出现在 10 个以上文档中的任何令牌”.. 这里您不考虑令牌在每个文档中出现的次数，只考虑数字它出现在的文档中。

当你将max_features设置为100时，它的意思是“按词频在语料库中排序（降序）（这意味着token在整个语料库中的每个文档中出现的次数），然后只考虑前 100 个标记”

【讨论】：

这不对——max_df 的范围是 0.0 到 1.0。
@tripleee: max_df 可以接受浮点数（文档的比例）或整数（原始文档数）。
根据问题中的描述它也可以是int ...描述指出如果它是int，那么你考虑绝对计数......我已经考虑了10的情况那是OP给出的例子
但是，如果它是指定范围内的浮点数，那么解释是什么？听起来应该叫max_idf？
@tripleee：请阅读the documentation。