python - 如何计算一个单词在Python中特定类别的列中重复的次数？答案

【问题标题】：How to count how many times a word is repeated in a column for a specific category in Python?python - 如何计算一个单词在Python中特定类别的列中重复的次数？
【发布时间】：2021-10-08 08:47:07
【问题描述】：

所以我已经被这个问题困扰了好几天，如果有人帮助我，我将不胜感激。我有一个数据框，列是：

 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----   
0   PhraseId    93636 non-null  int64   
1   SentenceId  93636 non-null  int64   
2   Phrase      93636 non-null  object  
3   Sentiment   93636 non-null  int64

情绪是从 0 到 4，基本上是从好到坏评价这个短语。我添加了两列可能会有所帮助：每个短语的单词数，并将每个短语拆分为一个列表，该列表包含短语中的单词。

我想做的是创建 4 个条形图（每种情绪的条形图），显示该情绪的前 15 个重复次数最多的单词。 x 轴将是该情绪中重复的前 15 个单词。

在下面，我粘贴了我编写的代码，该代码计算每个情绪中一个单词重复的次数。这可能是条形图所需要的。

样本数据：

       PhraseId SentenceId  Phrase                Sentiment SplitPhrase  NumOfWords
44723   75358   3866        Build some robots...    0   [Build, some, robots...] 52

计算一个词对每种情绪重复的次数：

counters = {}
for Sentiment in train_data['Sentiment'].unique():
    counters[Sentiment] = Counter()
    indices = (train_data['Sentiment'] == Sentiment)
    for Phrase in train_data['SplitPhrase'][indices]:
        counters[Sentiment].update(Phrase)
        
print(counters)

示例输出：

{2: Counter({'the': 28041, ',': 25046, 'a': 19962, 'of': 19376, 'and': 19052, 'to': 13470, '.': 10505, "'s": 10290, 'in': 8108, 'is': 8012, 'that': 7276, 'it': 6176, 'as': 5027, 'with': 4474, 'for': 4362, 'its': 4159, 'film': 3933......}),
 3: Counter({'the': 28041, ',': 25046, 'a': 19962,.....

【问题讨论】：

你的解释很有道理；但是，请包括样本数据，而不仅仅是df.info() 的输出。请参阅此链接了解如何提出一个好的pandas 问题：stackoverflow.com/questions/20109391/…
好的，谢谢，我附上样本数据的图片
没有图片！请read我分享的链接:)
我再次编辑，希望这会更好。我还稍微改变了我的问题，因为我找到了一种方法来计算每种情绪一个词重复了多少次。我现在需要根据它创建一个条形图。

标签： python pandas dataframe

【解决方案1】：

您可以使用 Pandas groupby 将每个情绪排列在一个独特的数据框中。然后，您可以将 Numpy unique 和 count 应用于 Phrase 列连接的文本，以计算每个单词的出现次数（针对该特定情绪组）。按频率计数 (lambda i: i[1]) 对结果列表进行排序并切片以获得前 15 个单词。对于条形图，您可以使用 Matplotlib plt.bar 传递单词和频率列表。

来自 dataframe.csv

的示例

                                               Phrase  PhraseId  SentenceId  Sentiment
0   Live as if you were to die tomorrow. Learn as ...     15795        2568          3
1       That which does not kill us makes us stronger       860       62592          3
2   Be who you are and say what you feel, because ...     76820       67563          0
..                                                ...       ...         ...        ...
97  Others can stop you temporarily – you are the ...     61228       73530          2
98  Life has no limitations, except the ones you make     48984       93557          3
99    Peace comes from within. Do not seek it without     40774       61087          3
[100 rows x 4 columns]

import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('dataframe.csv')
print(df)

def remove_stop_words(txt):
    txt = re.sub(r'[,.;!?-]', '', txt.lower())
    stop_words = ['the', 'at', 'in', 'of', 'a', 'is', 'to', 'by']
    stop_boundary = r'\b'+r'\b|\b'.join(stop_words)+r'\b'
    return re.sub(stop_boundary, '', txt)

MAX_WORDS = 15
SENTIMENT = ['Bad', 'Poor', 'Good', 'Excellent']

for n, g in df.groupby('Sentiment'):
    all_text = ' '.join(g['Phrase'].values)

    # optionally, clean txt and remove stop words
    clean_text = remove_stop_words(all_text)

    # find most frequent words
    split_txt = clean_text.split()
    word_count = [(word, split_txt.count(word)) for word in np.unique(split_txt)]
    word_count = sorted(word_count, key=lambda i: i[1], reverse=True)[:MAX_WORDS]
    x, y = zip(*word_count)

    # plot graph
    plt.subplot(2,2,n+1)
    plt.bar(x,y)
    plt.title(SENTIMENT[n])
    plt.ylabel('Count')
    plt.xticks(rotation=45)

plt.show()

【讨论】：