如何有效地计算字符串中字符频率的前缀和？答案

【问题标题】：How to efficiently calculate prefix sum of frequencies of characters in a string?如何有效地计算字符串中字符频率的前缀和？
【发布时间】：2019-09-18 03:03:02
【问题描述】：

说，我有一个字符串

s = 'AAABBBCAB'

如何有效计算字符串中每个字符的前缀频率和，即：

psum = [{'A': 1}, {'A': 2}, {'A': 3}, {'A': 3, 'B': 1}, {'A': 3, 'B': 2}, {'A': 3, 'B': 3}, {'A': 3, 'B': 3, 'C': 1}, {'A': 4, 'B': 3, 'C': 1}, {'A': 4, 'B': 4, 'C': 1}]

【问题讨论】：

最后你想要一个字典，或者你想要一个每个字符的字典列表，同时阅读？
@Vanjith 我想要一个正在运行的字符频率计数器。

标签： python python-3.x string

【解决方案1】：

您可以使用itertools.accumulate 和collections.Counter 在一行中完成：

from collections import Counter
from itertools import accumulate

s = 'AAABBBCAB'
psum = list(accumulate(map(Counter, s)))

这将为您提供Counter 对象的列表。现在，要在 O(1) 时间内获取 s 的任何子字符串的频率，您可以简单地减去计数器，例如：

>>> psum[6] - psum[1]  # get frequencies for s[2:7]
Counter({'B': 3, 'A': 1, 'C': 1})

【讨论】：

【解决方案2】：

这是一个选项：

from collections import Counter

c = Counter()
s = 'AAABBBCAB'

psum = []
for char in s:
    c.update(char)
    psum.append(dict(c))

# [{'A': 1}, {'A': 2}, {'A': 3}, {'A': 3, 'B': 1}, {'A': 3, 'B': 2}, 
#  {'A': 3, 'B': 3}, {'A': 3, 'B': 3, 'C': 1}, {'A': 4, 'B': 3, 'C': 1},
#  {'A': 4, 'B': 4, 'C': 1}]

我使用collections.Counter 来保持“运行总和”并将（结果的副本）添加到列表psum。这样我只在字符串s上迭代一次。

如果您希望在结果中包含 collections.Counter 对象，您可以将最后一行更改为

psum.append(c.copy())

为了得到

[Counter({'A': 1}), Counter({'A': 2}), ...
 Counter({'A': 4, 'B': 4, 'C': 1})]

同样的结果也可以通过这个来实现（使用accumulate 是第一次提出in Eugene Yarmash's answer；我只是避免map 支持生成器表达式）：

from itertools import accumulate
from collections import Counter

s = "AAABBBCAB"
psum = list(accumulate(Counter(char) for char in s))

只是为了完整性（因为这里还没有“纯dict”的答案）。如果您不想使用Counter 或defaultdict，您也可以使用它：

c = {}
s = 'AAABBBCAB'

psum = []
for char in s:
    c[char] = c.get(char, 0) + 1
    psum.append(c.copy())

虽然defaultdict 通常比dict.get(key, default) 性能更高。

【讨论】：

这里我们甚至不需要Counter，一个简单的defaultdict 就可以了@hiro-protagonist，请在下面查看我的答案！
是什么让你说defaultdict 比Counter 更“简单”？以何种方式更简单？
@DeveshKumarSingh 它们都是dict 的子类；计数器的数据结构并不比dict 复杂。或者我错过了什么？
@DeveshKumarSingh，这种考虑是错误的。我已经指出了时间性能差异，但 OP 应该做出他（她）自己的决定。
@DeveshKumarSingh：你的答案比这个晚，它是完全相同的结构，但类型略有不同，它具有相同的复杂性，但输出更详细。你不应该在这里做广告。

【解决方案3】：

实际上你甚至不需要计数器，只需要一个 defaultdict 就足够了！

from collections import defaultdict

c = defaultdict(int)
s = 'AAABBBCAB'

psum = []

#iterate through the character
for char in s:
    #Update count for each character
    c[char] +=1
    #Add the updated dictionary to the output list
    psum.append(dict(c))

print(psum)

输出看起来像

[{'A': 1}, {'A': 2}, {'A': 3}, {'A': 3, 'B': 1}, 
{'A': 3, 'B': 2}, {'A': 3, 'B': 3}, 
{'A': 3, 'B': 3, 'C': 1}, {'A': 4, 'B': 3, 'C': 1}, 
{'A': 4, 'B': 4, 'C': 1}]

【讨论】：

【解决方案4】：

最简单的方法是使用集合中的 Counter 对象。

from collections import Counter

s = 'AAABBBCAB'

[ dict(Counter(s[:i]) for i in range(1,len(s))]

产量：

[{'A': 1},  {'A': 2},  {'A': 3},  {'A': 3, 'B': 1},  {'A': 3, 'B': 2},
{'A': 3, 'B': 3},  {'A': 3, 'B': 3, 'C': 1},  {'A': 4, 'B': 3, 'C': 1}]

【讨论】：

请注意，Counter 是dict 的子类，因此没有理由将Counter 替换为普通的dict。
我同意，但它更符合用户指定的输出。我会自己保留 Counter 对象，因为它们除了作为字典之外还具有有用的功能。
这是一个优雅的 1-liner 所以 +1，但它是二次的而不是线性的。我怀疑hiro protagonist的类似解决方案更有效。

【解决方案5】：

在 Python 3.8 中，您可以使用带有 assignment expression（又名“海象运算符”）的列表推导：

>>> from collections import Counter
>>> s = 'AAABBBCAB'
>>> c = Counter()
>>> [c := c + Counter(x) for x in s]
[Counter({'A': 1}), Counter({'A': 2}), Counter({'A': 3}), Counter({'A': 3, 'B': 1}), Counter({'A': 3, 'B': 2}), Counter({'A': 3, 'B': 3}), Counter({'A': 3, 'B': 3, 'C': 1}), Counter({'A': 4, 'B': 3, 'C': 1}), Counter({'A': 4, 'B': 4, 'C': 1})]

【讨论】：