在 nltk python 中创建一组停用词答案

【问题标题】：creating set of stopwords in nltk python在 nltk python 中创建一组停用词
【发布时间】：2020-04-14 13:02:02
【问题描述】：

我知道 NLTk 停用词有很多语言，但如果我想创建自己的停用词集并想在 NLTK 停用词中使用它们，那可行吗？

import nltk
from nltk.corpus import stopwords
stops=set(stopwords.words('My own set'))
words=["Don't", 'hesitate','to','ask','questions']
print([word for word in words if word not in stops])

【问题讨论】：

将其定义为stops = ("your","stop","words") 并在您的代码中使用它
我认为把它作为一个数组会使得程序非常慢，特别是对于 NLP 和大数据集有没有办法把它作为一个集合？
stops 仅设置
你有办法将它从 txt 文件或 csv 文件导入吗？

标签： python nlp nltk stop-words

【解决方案1】：

将带有空格的停用词集作为分隔符存储在诸如 stop.txt 之类的文本文件中 stop_words = open('stop.txt','r').read().split()

这将返回包含停用词的列表。

【讨论】：

【解决方案2】：

另一种或可能成本更低的方法是创建一个 FILENAME.py 文件，其中包含停用词作为列表。然后导入 FILENAME.py 并调用停用词列表。这将消除 I/O。

【讨论】：

例如，如果你有一个 StopWordsFile.py，它包含一个 stopwordslist = ['they', 'she', 'he', 'we'] 的列表。然后在另一个文件上导入它，即。从 StopWordsFile 导入停用词列表