各种 numpy 花式索引方法的性能，也与 numba答案

【问题标题】：Performance of various numpy fancy indexing methods, also with numba各种 numpy 花式索引方法的性能，也与 numba
【发布时间】：2018-02-12 23:20:44
【问题描述】：

因为对于我的程序，Numpy 数组的快速索引是非常必要的，而且考虑到性能，花哨的索引并没有很好的声誉，所以我决定进行一些测试。尤其是Numba 发展很快，我尝试了哪些方法与 numba 配合得很好。

作为输入，我一直在为我的小数组测试使用以下数组：

import numpy as np
import numba as nb

x = np.arange(0, 100, dtype=np.float64)  # array to be indexed
idx = np.array((0, 4, 55, -1), dtype=np.int32)  # fancy indexing array
bool_mask = np.zeros(x.shape, dtype=np.bool)  # boolean indexing mask
bool_mask[idx] = True  # set same elements as in idx True
y = np.zeros(idx.shape, dtype=np.float64)  # output array
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)  #bool output array (only for convenience)

以及用于我的大型数组测试的以下数组（此处需要y_bool 来处理来自randint 的欺骗号码）：

x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/50))
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
y = np.zeros(idx.shape, dtype=np.float64)
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)

这会在不使用 numba 的情况下产生以下时间：

%timeit x[idx]
#1.08 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 129 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit x[bool_mask]
#482 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 621 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.take(x, idx)
#2.27 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 112 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.take(x, idx, out=y)
#2.65 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 134 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit x.take(idx)
#919 ns ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 108 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit x.take(idx, out=y)
#1.79 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# larg arrays: 131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.compress(bool_mask, x)
#1.93 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 618 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.compress(bool_mask, x, out=y_bool)
#2.58 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 637 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit x.compress(bool_mask)
#900 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit x.compress(bool_mask, out=y_bool)
#1.78 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.extract(bool_mask, x)
#5.29 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 641 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

还有numba，在nopython-mode、caching和nogil中使用jitting，我修饰了索引的方式，numba支持：

@nb.jit(nopython=True, cache=True, nogil=True)
def fancy(x, idx):
    x[idx]

@nb.jit(nopython=True, cache=True, nogil=True)
def fancy_bool(x, bool_mask):
    x[bool_mask]

@nb.jit(nopython=True, cache=True, nogil=True)
def taker(x, idx):
    np.take(x, idx)

@nb.jit(nopython=True, cache=True, nogil=True)
def ndtaker(x, idx):
    x.take(idx)

对于小型和大型数组，这会产生以下结果：

%timeit fancy(x, idx)
#686 ns ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 84.7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit fancy_bool(x, bool_mask)
#845 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 843 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit taker(x, idx)
#814 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 87 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit ndtaker(x, idx)
#831 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 85.4 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

总结

虽然对于没有 numba 的 numpy，但显然小数组最好用布尔掩码索引（与 ndarray.take(idx) 相比大约是 2 倍），而对于较大的数组 ndarray.take(idx) 将表现最好，在这种情况下大约是 6 次比布尔索引更快。盈亏平衡点的数组大小约为 1000 单元格，索引数组大小约为 20 单元格。
对于具有1e5 元素和5e3 索引数组大小的数组，ndarray.take(idx) 将比布尔掩码索引快10 倍。因此，布尔索引似乎随着数组大小而显着减慢，但在达到某个数组大小阈值后会稍微赶上。

对于 numba jitted 函数，除了布尔掩码索引之外，所有索引函数都有一个小的加速。简单的花式索引在这里效果最好，但仍然比没有抖动的布尔掩码慢。
对于较大的数组，布尔掩码索引比其他方法慢很多，甚至比非 jitted 版本慢。其他三种方法的性能都非常好，比非抖动版本快 15% 左右。

对于我有许多不同大小的数组的情况，使用 numba 进行花式索引是最好的方法。也许其他人也可以在这篇相当长的帖子中找到一些有用的信息。

编辑：
对不起，我忘了问我的问题，事实上我有。我只是在工作日结束时快速输入这个并完全忘记了它...... 那么，您知道比我测试的方法更好更快的方法吗？使用 Cython 我的时间介于 Numba 和 Python 之间。
由于索引数组是预先定义的，并且在长时间迭代中使用而无需更改，因此任何预定义索引过程的方式都会很棒。为此，我考虑过使用 strides。但我无法预先定义一组自定义的步幅。是否可以使用 strides 将预定义的视图放入内存中？

编辑 2：
我想我会把关于预定义常量索引数组的问题转移到一个新的更具体的问题上，这些数组将在相同的值数组（只有值改变但形状不改变）上进行几百万次迭代。这个问题太笼统了，也许我也提出了这个问题有点误导。我会在打开新问题后立即在此处发布链接！
Here is the link to the followup question.

【问题讨论】：

这里有什么问题？问一个真正的问题并自己回答不是更好吗？
Scotty，将您的问题更改为实际问题，然后将所有内容粘贴到自我回答中。如果您愿意，我将通过社区 wiki 将其粘贴，因此您可以在关闭（并删除）之前接受“不清楚您在问什么”
@DanielF 感谢您的提示！我在最后添加了一个问题！

标签： python performance numpy indexing numba

【解决方案1】：

您的总结并不完全正确，您已经使用不同大小的数组进行了测试，但您没有做的一件事是更改索引元素的数量。

我将其限制为纯索引并省略了take（实际上是整数数组索引）和compress 和extract（因为它们实际上是布尔数组索引）。这些的唯一区别是常数因子。方法 take 和 compress 的常数因子将小于 numpy 函数 np.take 和 np.compress 的开销，但对于合理大小的数组，影响可以忽略不计。

让我用不同的数字来表示它：

# ~ every 500th element
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/500))  # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True

%timeit x[idx]
# 51.6 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
# 1.03 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# ~ every 50th element
idx = np.random.randint(0, 1000000, size=int(1000000/50))  # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True

%timeit x[idx]
# 1.46 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x[bool_mask]
# 2.69 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# ~ every 5th element
idx = np.random.randint(0, 1000000, size=int(1000000/5))  # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True

%timeit x[idx]
# 14.9 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x[bool_mask]
# 8.31 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

那么这里发生了什么？很简单：整数数组索引只需要访问与索引数组中的值一样多的元素。这意味着如果匹配很少，它会很快，但如果有很多索引，它会很慢。然而，布尔数组索引总是需要遍历整个布尔数组并检查“真”值。这意味着它应该是数组的大致“常数”。

但是，等等，对于布尔数组来说，它并不是真正恒定的，为什么整数数组索引需要比布尔数组索引更长的时间（最后一种情况），即使它必须处理大约 5 倍的元素？

这就是它变得更加复杂的地方。在这种情况下，布尔数组在随机位置有True，这意味着它将受到分支预测失败的影响。如果True 和False 的出现次数相同，但出现在随机位置，则这些可能性更大。这就是布尔数组索引变慢的原因——因为True 与False 的比率变得更加平等，因此更加“随机”。如果Trues 越多，结果数组也会越大，这也会消耗更多时间。

作为这个分支预测的例子，使用这个作为例子（可能因不同的系统/编译器而异）：

bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[:1000000//2] = True   # first half True, second half False
%timeit x[bool_mask]
# 5.92 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True   # True and False alternating
%timeit x[bool_mask]
# 16.6 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True
np.random.shuffle(bool_mask)  # shuffled
%timeit x[bool_mask]
# 18.2 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

因此True 和False 的分布将严重影响带有布尔掩码的运行时，即使它们包含相同数量的Trues！ compress-functions 也可以看到同样的效果。

对于整数数组索引（同样np.take），另一个效果将是可见的：cache locality。您的情况下的索引是随机分布的，因此您的计算机必须执行大量“RAM”来“处理器缓存”加载，因为两个索引不太可能彼此靠近。

比较一下：

idx = np.random.randint(0, 1000000, size=int(1000000/5))
%timeit x[idx]
# 15.6 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

idx = np.random.randint(0, 1000000, size=int(1000000/5))
idx = np.sort(idx)  # sort them
%timeit x[idx]
# 4.33 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

通过对索引进行排序，下一个值已经在缓存中的机会大大增加，这可以带来巨大的加速。如果您知道索引将被排序，这是一个非常重要的因素（例如，如果它们是由 np.where 创建的，它们会被排序，这使得 np.where 的结果对于索引特别有效）。

因此，对于小型数组而言，整数数组索引并不慢，而对于大型数组而言，它更快，这取决于更多的因素。两者都有自己的用例，根据具体情况，一个可能（相当）比另一个快。

让我也谈谈 numba 函数。首先是一些一般性的陈述：

cache 不会有什么不同，它只是避免重新编译函数。在交互式环境中，这基本上是无用的。不过，如果您将函数打包在一个模块中会更快。
nogil 本身不会提供任何速度提升。如果在不同的线程中调用会更快，因为每个函数执行都可以释放 GIL，然后多个调用可以并行运行。

否则我不知道 numba 如何有效地实现这些功能，但是当您在 numba 中使用 NumPy 功能时，它可能会更慢或更快 - 但即使它更快，也不会更快（可能对于小型数组除外） .因为如果它可以变得更快，NumPy 开发人员也会实现它。我的经验法则是：如果你可以用 NumPy 做到这一点（矢量化），就不要打扰 numba。只有当你不能使用向量化的 NumPy 函数或者 NumPy 会使用太多的临时数组时，numba 才会发光！

【讨论】：

非常感谢您的解释和付出的努力！最后，我的代码中有一个案例，它受到分支预测失败的强烈影响。 :) 由于与数组大小相比，我大约 80% 的索引数组非常稀疏并已排序，因此我将坚持使用 take 或整数数组索引。其他 20% 的大小几乎与要索引且未排序的数组大小相同，因此我将使用布尔值。我刚刚在我的用例中对其进行了测试，这似乎是最好的方法。 :)
并缓存和 nogil：我的大多数 numba 函数都打包在一个模块中，因此 cache=True 是我的默认选项，因为我打算选择 parallel=True 选项，所以我尝试提前使我的所有功能nogil-兼容。但是不知道cache的真实效果，谢谢解释！我还有点不清楚：是否可以为整数索引数组预定义像 strides 这样的内存访问模式，以便在需要时快速访问 numpy 数组的内存？
Puh，strides ...据我了解，您需要一些模式来处理 strides（简单地使用单个项目偏移量可能不会，但您可以加快速度）。对不起，我之前没有看到问题的更新（对不起，我昨天甚至编辑了它的一些部分）。我认为 strides 解决方案或更快的解决方案取决于其他因素：您是否连续多次使用相同的布尔掩码或索引数组？
@Scotty1- 使用带有 numba 的 parallel=True 参数时要小心。我经常回答出错或无效的问题：stackoverflow.com/questions/35459065、stackoverflow.com/questions/46009368、stackoverflow.com/questions/45610292
是的，目前parallel=True 只给了我大约 20% 的小幅加速（但不适用于索引...对于我的其他计算，包括一些索引，但主要是数组操作）。它还与cache=True 发生冲突，所以我必须分析一下，如果在模块中打包它实际上并没有减慢我的代码速度......添加到我最初的问题中是微不足道的。是的，我的掩码/索引数组只定义一次，并在一次迭代中使用了数百万次。