计算 Pandas 中单词出现次数的最快方法

【问题标题】：Fastest way to count occurrence of words in Pandas计算 Pandas 中单词出现次数的最快方法
【发布时间】：2020-02-22 01:57:11
【问题描述】：

我有一个字符串列表。我想计算 Pandas 列的每一行中所有单词的出现次数，并用这个计数添加一个新列。

words = ["I", "want", "please"]
data = pd.DataFrame({"col" : ["I want to find", "the fastest way", "to 
                              count occurrence", "of words in a column", "Can you help please"]})
data["Count"] = data.col.str.count("|".join(words))
print(data)

此处显示的代码完全符合我的要求，但运行长文本和长单词列表需要很长时间。你能建议一种更快的方法来做同样的事情吗？

谢谢

【问题讨论】：

标签： python string count

【解决方案1】：

也许你可以使用Counter。如果您有多组words 来测试相同的文本，只需在应用Counter 后保存中间步骤。由于这些计数的单词现在位于以单词为键的字典中，因此测试该字典是否包含给定单词是一个 O(1) 操作。

from collections import Counter

data["Count"] = (
    data['col'].str.split()
    .apply(Counter)
    .apply(lambda counts: sum(word in counts for word in words))
)
>>> data
                    col  Count
0        I want to find      2
1       the fastest way      0
2   to count occurrence      0
3  of words in a column      0
4   Can you help please      1

【讨论】：

我测试了您的解决方案，时间除以 4。谢谢。