如何在 python 中修改 NLTK 停用词列表？答案

【问题标题】：How can I modify the NLTK the stop word list in python?如何在 python 中修改 NLTK 停用词列表？
【发布时间】：2018-07-21 18:10:54
【问题描述】：

我对 python/编程社区比较陌生，所以请原谅我的相对简单的问题：我想在对 csv 文件进行词形还原之前过滤掉停用词。但我需要将停用词“this”/“these”包含在最后一组中。

在 Python 中导入 nltk 停用词并将其定义为

stopwords = set(stopwords.words('english'))

...我怎样才能修改这个集合来保留“this”/“these”？

我知道我可以手动列出除这两个问题之外的每个单词，但我正在寻找更优雅的解决方案。

【问题讨论】：

【解决方案1】：

如果您希望将这些停用词包含在最终集合中，只需将它们从默认停用词列表中删除即可：

new_stopwords = set(stopwords.words('english')) - {'this', 'these'}

或者，

to_remove = ['this', 'these']
new_stopwords = set(stopwords.words('english')).difference(to_remove)

set.difference 接受任何可迭代对象。

【讨论】：