了解python scikit-learn中的文本特征提取TfidfVectorizer答案

【问题标题】：Understanding Text feature extraction TfidfVectorizer in python scikit-learn了解python scikit-learn中的文本特征提取TfidfVectorizer
【发布时间】：2018-05-13 10:27:15
【问题描述】：

阅读 scikit-learn 中文本特征提取的文档，我不确定 TfidfVectorizer（可能是其他矢量化器）可用的不同参数如何影响结果。

以下是我不确定它们如何工作的论点：

TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

文档清楚地说明了 stop_words/ max_df 的使用（两者具有相似的效果，可以使用一个代替另一个）。但是，我不确定这些选项是否应该与 ngrams 一起使用。哪一个首先出现/处理，ngrams 还是 stop_words？为什么？根据我的实验，先去除停用词，但是ngrams的目的是提取短语等。我不确定这个序列的效果（去除停用词然后ngramed）。

其次，将 max_df/min_df 参数与 use_idf 参数一起使用是否有意义？这些目的不都是相似的吗？

【问题讨论】：

标签： python scikit-learn

【解决方案1】：

我在这篇文章中看到了几个问题。

TfidfVectorizer 中的不同参数如何相互交互？

你真的需要大量使用它来培养直觉（反正我的经验就是这样）。

TfidfVectorizer 是一个词袋方法。在 NLP 中，单词序列及其窗口很重要；这种破坏了一些上下文。

如何控制输出哪些令牌？

将ngram_range 设置为 (1,1) 用于仅输出一个单词标记，(1,2) 用于输出一个单词和两个单词标记，(2, 3) 用于输出两个单词和三个单词标记等。

ngram_range 与analyzer 携手合作。设置analyzer为“word”输出单词和短语，或设置为“char”输出字符ngram。

如果您希望输出同时具有“word”和“char”特征，请使用 sklearn 的 FeatureUnion。示例here。

如何删除不需要的东西？

使用stop_words 删除无意义的英文单词。

sklearn 使用的停用词列表可以在以下位置找到：

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

删除停用词的逻辑与这些词没有太多含义有关，而且它们在大多数文本中出现很多：

[('the', 79808),
 ('of', 40024),
 ('and', 38311),
 ('to', 28765),
 ('in', 22020),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681),
 ('his', 10034),
 ('is', 9773),
 ('with', 9739),
 ('as', 8064),
 ('i', 7679),
 ('had', 7383),
 ('for', 6938),
 ('at', 6789),
 ('by', 6735),
 ('on', 6639)]

由于停用词通常具有很高的频率，因此使用 max_df 作为浮点数（例如 0.95）来删除前 5% 可能是有意义的，但是您假设前 5% 都是停用词，这可能并非如此。这实际上取决于您的文本数据。在我的工作中，最常见的词或短语不是停用词是很常见的，因为我在非常具体的主题中使用密集的文本（搜索查询数据）。

使用min_df 作为整数来删除不常见的单词。如果它们只出现一次或两次，它们不会增加太多价值，而且通常非常晦涩难懂。此外，它们通常很多，因此使用 min_df=5 忽略它们可以大大减少内存消耗和数据大小。

如何包含被剥离的内容？

token_pattern 使用正则表达式模式\b\w\w+\b，这意味着标记必须至少有 2 个字符长，以便删除诸如“I”、“a”之类的单词，并删除诸如 0 - 9 之类的数字。您还会注意到它删除了撇号

首先发生什么，ngram 生成还是停用词删除？

让我们做一个小测试。

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

docs = np.array(['what is tfidf',
        'what does tfidf stand for',
        'what is tfidf and what does it stand for',
        'tfidf is what',
        "why don't I use tfidf",
        '1 in 10 people use tfidf'])

tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))
matrix = tfidf.fit_transform(docs).toarray()

df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())

for doc in docs:
    print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))

打印出来：

tfidf
does tfidf stand
tfidf does stand
tfidf
don't I use tfidf
1 10 people use tfidf

现在让我们打印 df:

                                           10  and  does  don  for   in   is  \
what is tfidf                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
what does tfidf stand for                 0.0  0.0   1.0  0.0  1.0  0.0  0.0   
what is tfidf and what does it stand for  0.0  1.0   1.0  0.0  1.0  0.0  1.0   
tfidf is what                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
why don't I use tfidf                     0.0  0.0   0.0  1.0  0.0  0.0  0.0   
1 in 10 people use tfidf                  1.0  0.0   0.0  0.0  0.0  1.0  0.0   

                                           it  people  stand  tfidf  use  \
what is tfidf                             0.0     0.0    0.0    1.0  0.0   
what does tfidf stand for                 0.0     0.0    1.0    1.0  0.0   
what is tfidf and what does it stand for  1.0     0.0    1.0    1.0  0.0   
tfidf is what                             0.0     0.0    0.0    1.0  0.0   
why don't I use tfidf                     0.0     0.0    0.0    1.0  1.0   
1 in 10 people use tfidf                  0.0     1.0    0.0    1.0  1.0   

                                          what  why  
what is tfidf                              1.0  0.0  
what does tfidf stand for                  1.0  0.0  
what is tfidf and what does it stand for   2.0  0.0  
tfidf is what                              1.0  0.0  
why don't I use tfidf                      0.0  1.0  
1 in 10 people use tfidf                   0.0  0.0

注意事项：

use_idf=False, norm=None 设置了这些，就相当于使用了sklearn的CountVectorizer。它只会返回计数。
请注意，“don't”一词已转换为“don”。您可以在此处将 token_pattern 更改为 token_pattern=r"\b\w[\w']+\b" 以包含撇号。
我们看到很多停用词

让我们去掉停用词，再看看 df：

tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))

输出：

                                           10  10 people  does  does stand  \
what is tfidf                             0.0        0.0   0.0         0.0   
what does tfidf stand for                 0.0        0.0   1.0         0.0   
what is tfidf and what does it stand for  0.0        0.0   1.0         1.0   
tfidf is what                             0.0        0.0   0.0         0.0   
why don't I use tfidf                     0.0        0.0   0.0         0.0   
1 in 10 people use tfidf                  1.0        1.0   0.0         0.0   

                                          does tfidf  don  don use  people  \
what is tfidf                                    0.0  0.0      0.0     0.0   
what does tfidf stand for                        1.0  0.0      0.0     0.0   
what is tfidf and what does it stand for         0.0  0.0      0.0     0.0   
tfidf is what                                    0.0  0.0      0.0     0.0   
why don't I use tfidf                            0.0  1.0      1.0     0.0   
1 in 10 people use tfidf                         0.0  0.0      0.0     1.0   

                                          people use  stand  tfidf  \
what is tfidf                                    0.0    0.0    1.0   
what does tfidf stand for                        0.0    1.0    1.0   
what is tfidf and what does it stand for         0.0    1.0    1.0   
tfidf is what                                    0.0    0.0    1.0   
why don't I use tfidf                            0.0    0.0    1.0   
1 in 10 people use tfidf                         1.0    0.0    1.0   

                                          tfidf does  tfidf stand  use  \
what is tfidf                                    0.0          0.0  0.0   
what does tfidf stand for                        0.0          1.0  0.0   
what is tfidf and what does it stand for         1.0          0.0  0.0   
tfidf is what                                    0.0          0.0  0.0   
why don't I use tfidf                            0.0          0.0  1.0   
1 in 10 people use tfidf                         0.0          0.0  1.0   

                                          use tfidf  
what is tfidf                                   0.0  
what does tfidf stand for                       0.0  
what is tfidf and what does it stand for        0.0  
tfidf is what                                   0.0  
why don't I use tfidf                           1.0  
1 in 10 people use tfidf                        1.0

要点：

之所以出现“不使用”标记是因为don't I use 剥离了't，并且因为I 少于两个字符，所以它被删除了，所以这些词被连接到don use...实际上是不是结构，可能会稍微改变结构！
答案：去除停用词，去除短字符，然后生成 ngram，可能会返回意外结果。

将 max_df/min_df 参数与 use_idf 参数一起使用是否有意义？

我认为，词频逆文档频率的全部意义在于允许对高频词（出现在排序频率列表顶部的词）进行重新加权。这种重新加权将采用频率最高的 ngram 并将它们从列表中移到较低的位置。因此，它应该可以处理max_df 场景。

也许您是想将它们从列表中移出（“重新加权”/取消优先级）还是完全删除它们，这更像是个人选择。

我经常使用min_df，如果您正在处理一个庞大的数据集，那么使用min_df 是有意义的，因为稀有词不会增加价值，只会导致很多处理问题。我很少使用max_df，但我确信在处理像所有维基百科这样的数据时，删除前 x% 可能是有意义的。

【讨论】：

【解决方案2】：

停用词删除不会影响您的 ngram。首先根据您的分词器和 ngram 范围创建一个词汇表（标记）列表，然后从该列表中删除停用词（因此只有 unigrams 会受到影响，因为停用词列表仅包含 ungrams）。请注意，如果您在标记化步骤中删除停用词（人们经常这样做），那么情况就不一样了，那么它们也不会包含在二元组中。
实际上，使用 min_df 可能会抵消 tf idf 的影响，因为在一个文档中出现两次的单词可能会获得高分（记住分数是针对文档的）。这取决于您系统的应用程序（信息检索/文本分类）。如果阈值很低，它应该不会影响很多文本分类，但检索可能会出现偏差（如果我想查找带有“西班牙”的文档并且它只出现一次，在一个文档中，在整个集合中？）。正如您所说，由于 use_idf 影响了 Max_df ，但是如果您从词汇表中删除该单词，它可能比仅将其权重低的影响更大。这又取决于您打算如何处理这些权重。

希望这会有所帮助。

【讨论】：

艾略特，非常感谢您的回复。请参阅 Jarad 在下面重新排序 stop_words 和 ngram 的回复。根据我的实验，我看到 stop_words 也被首先删除。但是你上面写的很有趣而且很理想——tokens是根据tokenizer创建的……可以举个例子吗？
是的！那是我的错，Jarad 表明我的想法实际上是错误的，并且很高兴知道它知道。事实上，我很少使用 scikit Vectorizer 去除停用词。我创建自己的标记器并在函数内部保留/删除我感兴趣的模式。你想要什么样的例子？
我想知道您所说的是否可以创建 ngrams 然后删除停用词，因为现在编写了代码。多想一想，应该没什么区别。谢谢