用for循环求和比用reduce更快？答案

【问题标题】：Summing with a for loop faster than with reduce?用for循环求和比用reduce更快？
【发布时间】：2018-05-23 03:09:54
【问题描述】：

我想看看 reduce 比使用 for 循环进行简单的数值运算快多少。这是我发现的（使用标准 timeit 库）：

In [54]: print(setup)
from operator import add, iadd
r = range(100)

In [55]: print(stmt1)    
c = 0
for i in r:
    c+=i        

In [56]: timeit(stmt1, setup)
Out[56]: 8.948904991149902
In [58]: print(stmt3)    
reduce(add, r)    

In [59]: timeit(stmt3, setup)
Out[59]: 13.316915035247803

再看一点：

In [68]: timeit("1+2", setup)
Out[68]: 0.04145693778991699

In [69]: timeit("add(1,2)", setup)
Out[69]: 0.22807812690734863

这里发生了什么？显然，reduce 确实比 for 循环更快，但函数调用似乎占主导地位。 reduce 版本不应该几乎完全在 C 中运行吗？在 for 循环版本中使用 iadd(c,i) 使其在约 24 秒内运行。为什么使用 operator.add 会比 + 慢这么多？我的印象是 + 和 operator.add 运行相同的 C 代码（我检查以确保 operator.add 不只是在 python 中调用 + 或其他任何东西）。

顺便说一句，仅使用 sum 运行时间约为 2.3 秒。

In [70]: print(sys.version)
2.7.1 (r271:86882M, Nov 30 2010, 09:39:13) 
[GCC 4.0.1 (Apple Inc. build 5494)]

【问题讨论】：

使用sum 的速度提高了 4 倍这一事实几乎表明“应该有一种明显的方法来做到这一点”。
@jsbbueno：没错，但我这样做是为了找出对序列进行一般数值计算的最快方法。在现实世界中，我肯定会使用 sum 来求和 :D 没有尝试过 mul，但我相信结果会相似。

标签： python performance

【解决方案1】：

reduce(add, r) 必须调用 add() 函数 100 次，因此函数调用的开销加起来——reduce 使用 PyEval_CallObject 在每次迭代中调用 add：

for (;;) {
    ...
    if (result == NULL)
        result = op2;
    else {
        # here it is creating a tuple to pass the previous result and the next
        # value from range(100) into func add():
        PyTuple_SetItem(args, 0, result);
        PyTuple_SetItem(args, 1, op2);
        if ((result = PyEval_CallObject(func, args)) == NULL)
            goto Fail;
    }

更新：在 cmets 中对问题的回应。

当您在 Python 源代码中键入 1 + 2 时，字节码编译器会执行原位加法并将该表达式替换为 3：

f1 = lambda: 1 + 2
c1 = byteplay.Code.from_code(f1.func_code)
print c1.code

1           1 LOAD_CONST           3
            2 RETURN_VALUE

如果添加两个变量a + b，编译器将生成字节码，加载这两个变量并执行 BINARY_ADD，这比调用函数执行添加要快得多：

f2 = lambda a, b: a + b
c2 = byteplay.Code.from_code(f2.func_code)
print c2.code

1           1 LOAD_FAST            a
            2 LOAD_FAST            b
            3 BINARY_ADD           
            4 RETURN_VALUE

【讨论】：

感谢您指出这一点！但是，它没有解释为什么原始的 '1+2' 比 'add(1,2)' 快 5 倍。事实上，在 for 中使用 iadd 时，reduce 比 for 快得多。
为什么您的示例使用第三方包而不是内置的dis 模块？
没有什么特别的原因，只是我现在碰巧在用它。
如果没有biteplay，则可以使用dis.dis(f1) 而不是byteplay.Code.from_code(f1.func_code)。

【解决方案2】：

这可能是复制 args 和返回值（即“add(1, 2)”）的开销，而不是简单地对数字文字进行操作

【讨论】：

【解决方案3】：

edit：切换出零而不是数组乘法可以大大缩小差距。

from functools import reduce
from numpy import array, arange, zeros
from time import time

def add(x, y):
    return x + y

def sum_columns(x):
    if x.any():
        width = len(x[0])
        total = zeros(width)
    for row in x:
        total += array(row)
    return total

l = arange(3000000)
l = array([l, l, l])

start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))

start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))

让你失望几乎没有任何区别。

Reduce took 0.03230619430541992 For loop took took 0.058577775955200195

old：如果 reduce 用于按索引将 NumPy 数组相加，它可以比 for 循环更快。

from functools import reduce
from numpy import array, arange
from time import time

def add(x, y):
    return x + y

def sum_columns(x):
    if x.any():
        width = len(x[0])
        total = array([0] * width)
    for row in x:
        total += array(row)
    return total

l = arange(3000000)
l = array([l, l, l])

start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))

start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))

结果

[      0       3       6 ..., 8999991 8999994 8999997]
Reduce took 0.024930953979492188
[      0       3       6 ..., 8999991 8999994 8999997]
For loop took took 0.3731539249420166

【讨论】：

在这个例子中 for 循环的速度非常慢有几个原因。 1) 使用zeros 而不是使用array([0] * width) 创建零数组。 2) l 数组中很少有元素有利于reduce 函数，因为for 循环开销很高。当你有 6 个或更多元素时，for 循环会更快。
@zeroth 它确实大大缩小了差距。它们在性能上与 python 3.6 和最新版本的 numpy 几乎相同。