CountVectorizer 上的词形还原不会删除停用词答案

【问题标题】：Lemmatization on CountVectorizer doesn't remove StopwordsCountVectorizer 上的词形还原不会删除停用词
【发布时间】：2018-10-13 18:37:47
【问题描述】：

我正在尝试从Skit-learn向CountVectorizer添加Lematization，如下

import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())

sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]

vectorizer.fit_transform(sentence)

这是输出：

[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']

更新

这是出现并已被词形化的停用词：

u'lar', u'ler', u'der'

它对所有单词进行词形分析，并且不会删除停用词。那么，有什么想法吗？

【问题讨论】：

你没有在CountVectorizer这里指定LemmaTokenizer。而且我在这段代码上没有得到与你相同的输出。
对不起，我的错误。但是如果你复制代码，它就不起作用了。我只是不删除停用词。
再次，我尝试了新代码，但在输出中没有找到任何停用词，存在于 stopwords.words('spanish') 和输出中的单词。你能在输出中指出哪个停用词没有被删除吗？
谢谢。已更新。

标签： scikit-learn nltk stop-words lemmatization countvectorizer

【解决方案1】：

那是因为词形还原是在停用词删除之前完成的。然后在stopwords.words('spanish')提供的停用词集中找不到词形还原的停用词。

CountVectorizer的完整工作顺序请参考my other answer here。它关于 TfidfVectorizer 但顺序相同。在那个答案中，第 3 步是词形还原，第 4 步是去除停用词。

所以现在要删除停用词，您有两个选择：

1) 您将停用词集自身进行词形还原，然后将其传递给 CountVectorizer 中的 stop_words 参数。

my_stop_words = [lemma(t) for t in stopwords.words('spanish')]
vectorizer = CountVectorizer(stop_words=my_stop_words, 
                             tokenizer=LemmaTokenizer())

2) 在 LemmaTokenizer 本身中包含停用词删除。

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text) if t not in stopwords.words('spanish')]

试试这些，如果不起作用，请发表评论。

【讨论】：

谢谢。我试过这个：tokenizer=lambda text: [lemma(t) for t in word_tokenize(text) if (t not in stopwords.words('spanish')) and (t not in punctuation)] 并且有效。你怎么看？