为什么当 maxsize 是 2 的幂时 python lru_cache 表现最好？答案

【问题标题】：Why does python lru_cache performs best when maxsize is a power-of-two?为什么当 maxsize 是 2 的幂时 python lru_cache 表现最好？
【发布时间】：2020-05-01 04:10:26
【问题描述】：

Documentation 这么说：

如果 maxsize 设置为 None，LRU 功能将被禁用，并且缓存可以无限制地增长。当 maxsize 是 2 的幂时，LRU 功能表现最佳。

有谁知道这个“二次幂”是从哪里来的？我猜这与实现有关。

【问题讨论】：

标签： python-3.x caching lru

【解决方案1】：

尺寸效应出现的地方

lru_cache() code 以一种非典型的方式运行其底层字典。在保持总大小不变的同时，缓存未命中会删除最旧的项目并插入一个新项目。

二次幂建议是这种删除和插入模式如何与底层dictionary implementation 交互的产物。

字典的工作原理

表格大小是 2 的幂。
已删除的键被替换为 dummy 条目。
新键有时可以重复使用 dummy 槽，有时则不能。
使用不同键重复删除和插入将用 dummy 条目填满表。
当表已满三分之二时，将运行 O(N) 调整大小操作。
由于活动条目的数量保持不变，因此调整大小操作实际上不会更改表大小。
调整大小的唯一作用是清除累积的虚拟条目。

性能影响

带有2**n 条目的字典有最多的可用空间用于dummy 条目，因此O(n) 调整大小发生的频率较低。

另外，sparse dictionaries have fewer hash collisions 比大多数完整的字典。碰撞会降低字典的性能。

重要的时候

lru_cache() 仅在缓存未命中时更新字典。此外，当有未命中时，将调用包装函数。因此，调整大小的效果只有在未命中率很高并且包装的函数非常便宜的情况下才有意义。

比给 maxsize 一个二次幂更重要的是使用最大的合理 maxsize。更大的缓存有更多的缓存命中——这就是大胜利的来源。

模拟

一旦 lru_cache() 已满并且发生第一次调整大小时，字典就会进入稳定状态并且永远不会变大。在这里，我们模拟了添加新的虚拟条目并定期调整大小清除它们时接下来会发生什么。

steady_state_dict_size = 2 ** 7     # always a power of two

def simulate_lru_cache(lru_maxsize, events=1_000_000):
    'Count resize operations as dummy keys are added'
    resize_point = steady_state_dict_size * 2 // 3
    assert lru_maxsize < resize_point
    dummies = 0
    resizes = 0
    for i in range(events):
        dummies += 1
        filled = lru_maxsize + dummies
        if filled >= resize_point:
           dummies = 0
           resizes += 1
    work = resizes * lru_maxsize    # resizing is O(n)
    work_per_event = work / events
    print(lru_maxsize, '-->', resizes, work_per_event)

这里是输出的摘录：

for maxsize in range(42, 85):
    simulate_lru_cache(maxsize)

42 --> 23255 0.97671
43 --> 23809 1.023787
44 --> 24390 1.07316
45 --> 25000 1.125
46 --> 25641 1.179486
  ...
80 --> 200000 16.0
81 --> 250000 20.25
82 --> 333333 27.333306
83 --> 500000 41.5
84 --> 1000000 84.0

这表明，当 maxsize 尽可能远离 resize_point 时，缓存的工作量明显减少。

历史

Python3.2 中的影响最小，当字典在调整大小时增加了4 x active_entries。

增长率降低到2 x active entries时became catastrophic的效果。

稍后a compromise was reached，将增长率设置为3 x used。通过默认情况下为我们提供更大的稳态大小，这显着缓解了这个问题。

2 的幂 maxsize 仍然是最佳设置，对于给定的稳态字典大小，工作量最少，但它不再像在 Python3.2 中那样重要。

希望这有助于澄清您的理解。 :-)

【讨论】：

我只想指出，这个问题的作者是添加问题中引用的原始“二次幂”评论的人:)
感谢 Raymond 的解释，让这个实现的作者在这里发表评论真是太酷了。 :) 我试图理解以下两句话：2 次幂为 n 个条目的 dict 具有最大的虚拟条目可用空间，因此 n 调整大小的顺序发生的频率较低 目前尚不清楚对我来说，为什么可用空间最多用于虚拟条目并且 一旦 lru_cache() 已满并且发生第一次调整大小，字典就会进入稳定状态并且永远不会变大 - 缓存可能已满没有调整大小？。
当 maxsize 为 42 但 42 不是 2 的幂时，每个事件的工作量最少。这让我感到困惑。我知道 42 是 64 的稳态 dict 大小的约 2/3。我在这里缺少什么？
@MrMatrix 缺少的部分是growth_factor。当它是2 x active_entries, plus one, and rounded-up to the new power of two 时，跳转到下一个更大的表大小正好是二的幂。更大的表为您在下次调整大小之前添加虚拟条目提供了最大的空间。

【解决方案2】：

TL;DR - 这是一个优化，对较小的 lru_cache 大小没有太大影响，但是（请参阅 Raymond 的回复）随着 lru_cache 大小变大，效果会更大。

所以这引起了我的兴趣，我决定看看这是否真的是真的。

首先我阅读了 LRU 缓存的源代码。 cpython 的实现在这里：https://github.com/python/cpython/blob/master/Lib/functools.py#L723 我没有看到任何让我觉得基于 2 的幂可以更好地运行的东西。

所以，我编写了一个简短的 python 程序来制作各种大小的 LRU 缓存，然后多次使用这些缓存。代码如下：

from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time

def run_test(i):
    # We create a new decorated perform_calc
    @lru_cache(maxsize=i)
    def perform_calc(input):
        return input * 3.1415

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        # Calculate the value for a range larger than our largest cache
        for k in range(2000):
            perform_calc(k)

for t in range(10):
    print (t)
    values = defaultdict(list)
    for i in range(1,1025):
        start = time.perf_counter()
        run_test(i)
        t = time.perf_counter() - start
        values[i].append(t)

for k,v in values.items():
    print(f"{k}\t{mean(v)}")

我在 macbook pro 上使用 python 3.7.7 在轻负载下运行它。

结果如下：

https://docs.google.com/spreadsheets/d/1LqZHbpEL_l704w-PjZvjJ7nzDI1lx8k39GRdm3YGS6c/preview?usp=sharing

随机峰值可能是由于 GC 暂停或系统中断。

此时我意识到我的代码总是会产生缓存未命中，而不会产生缓存命中。如果我们运行相同的东西，但总是命中缓存会发生什么？

我将内部循环替换为：

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        # Only ever create cache hits
        for k in range(i):
            perform_calc(k)

此数据位于与上述相同的电子表格的第二个标签中。

让我们看看：

嗯，但我们并不真正关心这些数字中的大部分。此外，我们没有为每个测试做同样多的工作，所以时间安排似乎没有用。

如果我们只运行 2^n 2^n + 1 和 2^n - 1 会怎样。由于这加快了速度，我们将平均它超过 100 次测试，而不是 10 次。

我们还将生成一个大的随机列表来运行，因为这样我们会期望有一些缓存命中和缓存未命中。

from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
import random

rands = list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128))
random.shuffle(rands)

def run_test(i):
    # We create a new decorated perform_calc
    @lru_cache(maxsize=i)
    def perform_calc(input):
        return input * 3.1415

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        for k in rands:
            perform_calc(k)

for t in range(100):
    print (t)
    values = defaultdict(list)
    # Interesting numbers, and how many random elements to generate
    for i in [15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 257, 511, 512, 513, 1023, 1024, 1025]:
        start = time.perf_counter()
        run_test(i)
        t = time.perf_counter() - start
        values[i].append(t)

for k,v in values.items():
    print(f"{k}\t{mean(v)}")

这方面的数据在上面电子表格的第三个标签中。

这是每个元素的平均时间/lru 缓存大小的图表：

当然，时间会随着缓存大小的增加而减少，因为我们不会花太多时间来执行计算。有趣的是，从 15 到 16、17 和 31 到 32、33 似乎确实有所下降。让我们放大更高的数字：

我们不仅在较高的数字中失去了这种模式，而且我们实际上看到性能下降对于某些 2 的幂（511 到 512、513）。

编辑：关于二次幂的注释是added in 2012，但 functools.lru_cache 的算法looks the same at that commit，所以不幸的是，这反驳了我的理论，即算法已经改变并且文档已经过时。

编辑：删除了我的假设。原作者在上面回答 - 我的代码的问题是我正在使用“小”缓存 - 这意味着在 dicts 上调整 O(n) 大小并不是很昂贵。尝试使用非常大的 lru_caches 和大量缓存未命中来看看我们是否可以让效果出现，这会很酷。

【讨论】：

2 的幂的最大大小提供最佳的缓存未命中性能对于给定的字典大小。如果允许该尺寸变大，则不会受到速度影响。相反，成本来自内存消耗。据推测，人们设置 maxsize 的唯一原因是他们想要限制内存使用。