在另一个一维 bin 数组中获取一维数组值的最小索引的最快方法答案

【问题标题】：Fastest way to get the min index of 1D array value in another 1D bin array在另一个一维 bin 数组中获取一维数组值的最小索引的最快方法
【发布时间】：2021-09-03 22:46:33
【问题描述】：

我想在另一个名为 bin 的一维数组中获取一个名为 value 的一维数组的索引，并计算每个 bin 中的最小索引。

这是整个步骤：

这是我目前的方法：

import numpy as np

bin = np.array([1, 2, 3, 4])
value = np.array([1.2, 1.3, 2.1, 3.1])
extend_bin = np.append(bin[1:], np.iinfo('int').max)
mask = (bin[:, None] <= value[None, :]) & (value[None, :] < extend_bin[:, None])
res = np.argmax(mask, axis=-1)[:-1]

但是，当两个一维数组较长时，由于二维掩码数组较大，我可能会出现内存错误：

import random
import numpy as np

length = int(4e6)

a = np.random.rand(length)
order = np.argsort(a)
bin = a[order]
value = np.random.rand(length)
random.shuffle(value)
extend_bin = np.append(bin[1:], np.iinfo('int').max)
mask = (a[:, None] <= value[None, :]) & (value[None, :] < extend_bin[:, None])
res = np.argmax(mask, axis=-1)[:-1]

内存错误：

    mask = (a[:, None] <= value[None, :]) & (value[None, :] < extend_bin[:, None])
numpy.core._exceptions.MemoryError: Unable to allocate 14.6 TiB for an array with shape (4000000, 4000000) and data type bool

有没有更简单有效的处理问题的方法？

【问题讨论】：

使用 Pandas cut 或 qcut 怎么样？ pandas.pydata.org/pandas-docs/stable/reference/api/…
@chatax 谢谢！我测试了pd.cut(df['value'], bins=bin, right=False) 和pd.cut(df['value'], bins=bin, right=False, labels=False)。只要发现labels=None 会大大减慢速度！你能发布你的答案并比较速度吗？我想这对用户来说会很有趣！

标签： python arrays numpy dask

【解决方案1】：

类似于 cmets 中提出的pd.cut，可以使用numpy.digitize：

import numpy as np
bin = np.array([1, 2, 3, 4])
value = np.array([1.2, 1.3, 2.1, 3.1])
extend_bin = np.append(bin[1:], np.iinfo('int').max)
binned_values = np.digitize(value, extend_bin)
# The above returns [0, 0, 1, 2]
_, res = np.unique(binned_values, return_index=True)
# res equals [0, 2, 3], i.e. the first index where each value happens

这也适用于扩展情况：

from time import time
length = int(4e6)
a = np.random.rand(length)
bin = np.concatenate([[0], np.sort(a), [1]])
value = np.random.rand(length)
tstart = time()
binned_values = np.digitize(value, bin)
print(time() - tstart)
# On my machine, the above takes about 4 seconds
tstart = time()
_, res = np.unique(binned_values, return_index=True)
print(time() - tstart)
# And this takes less than one second

【讨论】：