为什么在使用 Python 的 wordcloud 库时，停用词没有从词云中排除？答案

【问题标题】：Why are stop words not being excluded from the word cloud when using Python's wordcloud library?为什么在使用 Python 的 wordcloud 库时，停用词没有从词云中排除？
【发布时间】：2020-09-09 05:18:47
【问题描述】：

我想在我的词云中排除“The”、“They”和“My”。我正在使用下面的 python 库“wordcloud”，并使用这 3 个额外的停用词更新 STOPWORDS 列表，但 wordcloud 仍然包括它们。我需要进行哪些更改才能排除这 3 个单词？

我导入的库是：

import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

我尝试在下面的 STOPWORDS 集中添加元素，但是即使成功添加了单词，wordcloud 仍然显示我添加到 STOPWORDS 集中的 3 个单词：

len(STOPWORDS) 输出：192

然后我跑了：

STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')

然后我跑了：

len(STOPWORDS) 输出：195

我正在运行 python 版本 3.7.3

我知道我可以在运行 wordcloud 之前修改文本输入以删除 3 个单词（而不是尝试修改 WordCloud 的 STOPWORDS 集），但我想知道 WordCloud 是否存在错误，或者我是否没有更新/使用 STOPWORDS正确吗？

【问题讨论】：

您是否尝试过将停用词全部添加为小写？
'the'、'they' 和 'my' 小写已经在 WordCloud 的 'STOPWORDS' 列表中。我在列表中添加了“The”、“They”和“My”。尽管现在在停用词列表中，但我添加的单词并未从 wordcloud 中排除。

标签： python nlp word-cloud stop-words

【解决方案1】：

pip install nltk

别忘了安装停用词。

python
>>> import nltk
>>> nltk.download('stopwords')

试一试：

from wordcloud import WordCloud
from matplotlib import pyplot as plt

from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
        background_color='white',
        max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

【讨论】：

感谢您的建议，但没有奏效。我在 NLTK 停用词列表中添加了相同的 3 个单词，wordcloud 仍在显示它们。
小写文本并使用您的输入。立即查看。
我在回答中解释了为什么小写文本确实会使停用词从云端消失，但这似乎不是一个完全理想的解决方案。此外，您不需要 nltk 停用词，wordcloud.STOPWORDS 会删除/他们/我等。顺便说一下，在text 的代码中，您在“sat”和“lovely”之后错过了空格“”。
在sat and lovely之后添加了空格

【解决方案2】：

Wordcloud 的默认值是 collocations=True，因此云中包含两个相邻单词的常用短语 - 重要的是，对于您的问题，搭配删除停用词是不同的，例如“谢谢”是一个有效的搭配，即使“你”在默认停用词中，也可能出现在生成的云中。仅包含停用词的搭配被删除。

这听起来不无道理的理由是，如果在构建搭配列表之前删除了停用词，那么例如“非常感谢”会提供“非常感谢”作为搭配，这是我绝对不想要的。

因此，要让您的停用词按照您的预期发挥作用，即云中根本不会出现停用词，您可以像这样使用collocations=False：

my_wordcloud = WordCloud(
    stopwords=my_stopwords,
    background_color='white', 
    collocations=False, 
    max_words=10).generate(all_tweets_as_one_string)

更新：

如果搭配 False，停用词全部小写，以便在删除它们时与小写文本进行比较 - 因此无需添加“The”等。
当停用词为小写时，搭配 True（默认值），在查找所有停用词搭配以删除它们时，源文本不是小写的，因此不会删除文本中的蛋 The 而 @987654330 @ 被删除 - 这就是 @Balaji Ambresh 的代码有效的原因，您会看到云中没有上限。这可能是 Wordcloud 中的一个缺陷，不确定。但是添加例如The 到停用词不会影响这一点，因为停用词总是小写，无论搭配真/假如何

这在源代码中都是可见的 :-)

例如使用默认的collocations=True 我得到：

使用collocations=False 我得到：

代码：

from wordcloud import WordCloud
from matplotlib import pyplot as plt


text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."

cloud = WordCloud(collocations=False,
        background_color='white',
        max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

【讨论】：

另见this github issue 1.7.0版