在 Pandas 中计算字符串中单词的出现次数答案

【问题标题】：Counting occurrences of word in a string in Pandas在 Pandas 中计算字符串中单词的出现次数
【发布时间】：2021-01-09 08:09:15
【问题描述】：

我正在尝试计算一个单词在 Pandas 系列的所有字符串中出现的次数我有一个数据框df，它遵循以下逻辑：

word

hi    
hello
bye
goodbye

还有一个看起来像这样的df_2（向右滚动查看另一列）

sentence                                                                            metric_x

hello, what a wonderful day                                                         10
I did not said hello today                                                          15
what comes first, hi or hello                                                       25
the most used word is hi                                                            30
hi or hello, which is more formal                                                   50
he said goodbye, even though he never said hi or hello in the first place           5

我试图在df 中实现以下目标：计算word 出现的次数以及与word 匹配的值的metric_x 总和是多少。

word        count       metric_x_sum
        
hi          4           110
hello       5           105
bye         0           0
goodbye     1           5

我正在使用这个：

df['count'] = df['word'].apply(lambda x: df_2['sentence'].str.count(x).sum())

问题在于数据帧的长度，我在df 中有70,000 唯一词和250,000 在df_2 中的唯一句子，上面的行运行了15分钟，我不知道如何它可能会运行很长时间。

让它运行 15 分钟后，我得到了这个错误：

error: multiple repeat at position 2

有没有更聪明、更快的方法来实现这一点？

【问题讨论】：

标签： python pandas string count

【解决方案1】：

单词和DataFrame.explode的第一个拆分句子，通过Series.str.strip删除尾随值,：

df2 = df_2.assign(word = df_2['sentence'].str.split()).explode('word')
df2['word'] = df2['word'].str.strip(',')
#print (df2)

然后DataFrame.merge 与左连接和聚合GroupBy.count 以排除与sum 的缺失值：

df3 = (df.merge(df2, on='word', how='left')
         .groupby('word')
         .agg(count=('metric_x', 'count'), metric_x_sum=('metric_x','sum')))
# print (df3)

最后添加到原件：

df = df.join(df3, on='word')
df['metric_x_sum'] = df['metric_x_sum'].astype(int)
print (df)
      word  count  metric_x_sum
0       hi      4           110
1    hello      5           105
2      bye      0             0
3  goodbye      1             5

【讨论】：

你好，谢谢。你能详细说明为什么你的方法比我的快吗？
@JonasPalačionis - 嗯，不知道如何获取metric_x 到metric_x_sum 的值，也应该更快，因为merge 在这里更快。