从单列 Pandas 数据框生成词云答案

【问题标题】：Generate word cloud from single-column Pandas dataframe从单列 Pandas 数据框生成词云
【发布时间】：2022-05-13 15:15:20
【问题描述】：

我有一个 Pandas 数据框，其中有一列：犯罪类型。该列包含 16 个不同的犯罪“类别”，我想将其可视化为一个词云，词的大小基于它们在数据框中的频率。

我已尝试使用以下代码执行此操作：

要引入数据：

fields = ['Crime type']

text2 = pd.read_csv('allCrime.csv', usecols=fields)

生成词云：

wordcloud2 = WordCloud().generate(text2)
# Generate plot
plt.imshow(wordcloud2)
plt.axis("off")
plt.show()

但是，我收到此错误：

TypeError: expected string or bytes-like object

我能够使用以下代码从完整数据集创建较早的词云，但我希望词云仅从特定列“犯罪类型”（“allCrime.csv”包含大约13 列）：

text = open('allCrime.csv').read()
wordcloud = WordCloud().generate(text)
# Generate plot
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

我是 Python 和 Pandas 的新手（通常是编码！），因此我们非常感谢所有帮助。

【问题讨论】：

你可能想检查this ...

标签： python pandas dataframe word-cloud

【解决方案1】：

问题在于，您使用的 WordCloud.generate 方法需要一个字符串，它将在该字符串上计算单词实例，但您提供了 pd.Series。

根据您希望词云生成的内容，您可以执行以下任一操作：

wordcloud2 = WordCloud().generate(' '.join(text2['Crime Type']))，它将连接数据框列中的所有单词，然后计算所有实例。
使用WordCloud.generate_from_frequencies 手动传递计算出的词频。

【讨论】：

感谢 languitar 和 @MaxU - 你的帖子组合对我有用。

【解决方案2】：

df = pd.read_csv('allCrime.csv', usecols=fields)

text = df['Crime type'].values 

wordcloud = WordCloud().generate(str(text))

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

【讨论】：

【解决方案3】：

您需要创建一个连接的输入文本。这可以通过join 函数来完成。

fields = ['Crime type']
text2 = pd.read_csv('allCrime.csv', usecols=fields)

text3 = ' '.join(text2['Crime Type'])
wordcloud2 = WordCloud().generate(text3)
# Generate plot
plt.imshow(wordcloud2)
plt.axis("off")
plt.show()

【讨论】：

【解决方案4】：

您可以在删除单个列的所有停用词的同时生成词云。假设您的数据框是 df 并且 col 名称是注释，那么以下代码可以提供帮助：

#Final word cloud after all the cleaning and pre-processing
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
comment_words = ' '
stopwords = set(STOPWORDS) 

# iterate through the csv file 
for val in df.comment: 
  
   # typecaste each val to string 
   val = str(val) 

   # split the value 
   tokens = val.split() 
  
# Converts each token into lowercase 
for i in range(len(tokens)): 
    tokens[i] = tokens[i].lower() 
      
for words in tokens: 
    comment_words = comment_words + words + ' '


wordcloud = WordCloud(width = 800, height = 800, 
            background_color ='white', 
            stopwords = stopwords, 
            min_font_size = 10).generate(comment_words) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show()

【讨论】：

这段代码和上面的问题有什么关系？你测试过上面的代码吗？上面的代码不起作用。从哪里得到 df 和 cmets？熊猫进口在哪里？请在发布前更正并测试代码。

【解决方案5】：

使用以下方法可以轻松完成：

df = pd.read_csv('allCrime.csv')
data = df['Crime type'].value_counts().to_dict()
wc = WordCloud().generate_from_frequencies(data)

plt.imshow(wc)
plt.axis('off')
plt.show()

【讨论】：

【解决方案6】：

import re
from wordcloud import WordCloud, STOPWORDS

# Remove punctuation
df['text_proc'] = \
df['text'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
df['text_proc'] = \
df['text_proc'].map(lambda x: x.lower())

# Print out the first rows of papers
df['text_proc'].head()


# Join the different processed titles together.
long_string = ','.join(list(df['text_proc'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, 
contour_color='steelblue')# Generate a word cloud
wordcloud.generate(long_string)# Visualize the word cloud
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud)
plt.show()

Wordcloud example

【讨论】：

您的答案可以通过额外的支持信息得到改进。请编辑以添加更多详细信息，例如引用或文档，以便其他人可以确认您的答案是正确的。您可以在帮助中心找到更多关于how to write good answers 的信息。