为什么 sklearn tf-idf vectorizer 给停用词最高分？答案

【问题标题】：Why does sklearn tf-idf vectorizer give the highest scores to stopwords?为什么 sklearn tf-idf vectorizer 给停用词最高分？
【发布时间】：2022-01-02 14:57:03
【问题描述】：

我用 sklearn 为 nltk 库中布朗语料库的每个类别实现了 Tf-idf。有 15 个类别，每个类别的最高分都分配给一个停用词。

默认参数是use_idf=True，所以我用的是idf。语料库足够大，可以计算出正确的分数。所以，我不明白 - 为什么停用词被赋予高值？

import nltk, sklearn, numpy
import pandas as pd
from nltk.corpus import brown, stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('brown')
nltk.download('stopwords')

corpus = []
for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

thisvectorizer = TfidfVectorizer()
X = thisvectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
features = thisvectorizer.get_feature_names_out()

for array in tfidf_matrix:
  tfidf_per_doc = list(zip(features, array))
  tfidf_per_doc.sort(key=lambda x: x[1], reverse=True)
  print(tfidf_per_doc[:3])

结果是：

[('the', 0.6893251240111703), ('and', 0.31175508121108203), ('he', 0.24393467757919754)]
[('the', 0.6907757197452503), ('of', 0.4103688069243256), ('and', 0.28727742797362427)]
[('the', 0.7263025975051108), ('of', 0.3656242079748301), ('to', 0.291070574384772)]
[('the', 0.6754696081456901), ('and', 0.31548027033056486), ('to', 0.2688347676067454)]
[('the', 0.6814989142114783), ('of', 0.45275950370682505), ('and', 0.2884682701141856)]
[('the', 0.695577697455948), ('of', 0.35341130124782577), ('and', 0.31967658612871513)]
[('the', 0.6319718467602307), ('and', 0.3252073024670836), ('of', 0.31905971640910474)]
[('the', 0.7201346766200954), ('of', 0.4283480504712354), ('and', 0.2462470090388333)]
[('the', 0.7145625245362096), ('of', 0.3795569321959571), ('and', 0.2911711705971684)]
[('the', 0.6452744438258314), ('to', 0.2965331457609836), ('and', 0.29378534827130653)]
[('the', 0.7507413874270662), ('of', 0.3364825248186412), ('and', 0.25753131787795447)]
[('the', 0.6883038024694869), ('of', 0.41770049303087814), ('and', 0.2675503490244296)]
[('the', 0.6952456562438267), ('of', 0.39285038765440655), ('and', 0.34045082029960866)]
[('the', 0.5816391566950566), ('and', 0.3731049841274644), ('to', 0.2960718382909285)]
[('the', 0.6514884130485116), ('of', 0.29645876610367955), ('to', 0.2766347756651356)]

每个词都是停用词。每个类别的前 15 个词大约是停用词。

如果我将参数stop_words 与nltk 内置停用词一起使用，则这些值或多或少都很好。但这对我来说没有意义 - Tf-idf 默认应该降级它们，不是吗？我是不是在某个地方犯了一个愚蠢的错误？

my_stop_words = stopwords.words('english')
thisvectorizer = TfidfVectorizer(stop_words=my_stop_words)

[('said', 0.27925480211869536), ('would', 0.18907877226786665), ('man', 0.18520023334955144)]
[('one', 0.2904582969159082), ('would', 0.1989714323107254), ('new', 0.1394799739062623)]
[('would', 0.2225121466087311), ('one', 0.21533433542780428), ('new', 0.1603044497073654)]
[('would', 0.3015860042740072), ('said', 0.20105733618267146), ('one', 0.19691182409643082)]
[('state', 0.20994145654158766), ('year', 0.16516637619246616), ('fiscal', 0.1627693480477495)]
[('one', 0.27315617167196987), ('new', 0.1339515841852929), ('time', 0.12957408143413954)]
[('said', 0.25253824925464713), ('barco', 0.2297681382507305), ('one', 0.22671047376269457)]
[('af', 0.53260466412674), ('one', 0.2029977500545255), ('may', 0.12401317094240104)]
[('one', 0.29617565661385375), ('time', 0.15556701155475144), ('would', 0.14135656338388475)]
[('said', 0.22644107030344426), ('would', 0.2097909916046616), ('one', 0.1986909391388065)]
[('said', 0.2724277852935244), ('mrs', 0.19471476451838934), ('would', 0.1650670817295739)]
[('god', 0.2540052570261857), ('one', 0.18304020379411245), ('church', 0.17784155752544287)]
[('one', 0.2402151822472666), ('mr', 0.1854602509997279), ('new', 0.16073221753309752)]
[('said', 0.32053197885047946), ('would', 0.23918851593978377), ('could', 0.18980141345828996)]
[('helva', 0.34147320176374735), ('ekstrohm', 0.27116989551827), ('would', 0.2609130084842849)]

【问题讨论】：

如果你使用my_stop_words = list(stopwords.words('english'))会发生什么

标签： python scikit-learn nltk tf-idf tfidfvectorizer

【解决方案1】：

由于您的语料库和 tfidf 计算存在问题，因此为停用词分配了较大的值。

矩阵X 的形状是(15, 42396)，这意味着您只有 15 个文档，这些文档包含 42396 个不同的单词。

错误是您将给定类别的所有单词连接到一个文档中，而不是在此 sn-p 中使用所有定义的文档：

for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

您可以将代码修改为：

for c in brown.categories():
    doc = [" ".join(x) for x in brown.sents(categories=c)]
    corpus.extend(doc)

这将为每个文档创建一个条目。因此，您的X 矩阵将具有(57340, 42396) 的形状。

这非常重要，因为停用词会出现在大多数文档中，这会给它们分配一个非常低的 TFIDF 值。

你可以用下面的 sn-p 看看最重要的 25 个单词：

import numpy as np
feature_names = thisvectorizer.get_feature_names_out()
sorted_nzs = np.argsort(X.data)[:-(25):-1]
feature_names[X.indices[sorted_nzs]]

输出：

 array(['customer', 'asked', 'properties', 'itch', 'locked', 'achieving',
        'jack', 'guess', 'criticality', 'me', 'sir', 'beckworth', 'visa',
        'will', 'casey', 'athletics', 'norms', 'yeah', 'eh', 'oh', 'af',
        'currency', 'example', 'movies'], dtype=object)

【讨论】：

谢谢！嗯，但最初有 15 个文档，停用词（如“the”）肯定在这 15 个文档中 - 为什么它们的价值很高？
我实际上故意在语料库中只有 15 个文档 - 我想比较 Brown 语料库中每个类别的最重要的词。
矩阵的形状是 (2351, 36092)，但是，我仍然有这个问题。最高分被分配给停用词。

【解决方案2】：

“语料库足够大......”。实际上，在这种情况下，语料库中每个文档/文本的大小才足够大。然而，语料库的大小只有 15 个文档（因此，idf 中的 N 将是 15 个）。如果您打印brown.categories()，您会看到布朗语料库包含15 个类别，用作您的文档。拥有一个小的语料库意味着某些术语（例如 s 停用词）将在语料库中的文档中具有相同的分布，因此将受到idf 的相同惩罚。例如，如果单词“customer”在语料库中与“and”一样出现（即，两者出现在相同数量的文档中），它们的 idf 值将是相同的；然而，停用词（如上面的“and”），由于它们通常较大的词频tf，它们将获得比“customer”等词更高的 tf-idf 分数；它也可能出现在每个文档中（例如），但词频较低。

然而，语料库中的文档数量只是问题的一部分，因为事实上，众所周知，Tf-idf 会降级这些频繁出现的术语，同时突出显示文档中频繁出现的术语和在所有其他人中很少见。第二个可能的原因是 sklearn 的 TfidfVectorizer（因此，TfidfTransformer）如何计算 tf-idf 分数。根据文档，默认情况下，tf-idf 公式计算为idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1（也有余弦归一化），它不同于标准公式，即idf(t) = log [ n / df(t) ]。因此，简而言之，在使用 tf-idf 时应该使用足够多的文档样本。此外，可能值得尝试计算 tf-idf 的标准公式，看看它是如何工作的。我最近发布了一个扩展答案，在一个非常相似的问题上对此进行了解释，表明随着语料库的大小（即文档数量）的增加，会消除更多的停用词（或语料库中常见的词）。请看here。

【讨论】：

矩阵的形状是 (2351, 36092)，但是，我仍然有这个问题。最高分被分配给停用词。