寻找基于使用 numpy 的出现次数对 3d 数组进行下采样的最快方法答案

【问题标题】：Looking for fastest method to downsample a 3d array based on occurences using numpy寻找基于使用 numpy 的出现次数对 3d 数组进行下采样的最快方法
【发布时间】：2020-03-21 23:32:24
【问题描述】：

给定一个类型为“uint8”的大型 3d numpy 数组（不会太大而无法放入内存），我想在每个维度中使用给定的比例因子来缩小该数组。你可以假设数组的形状可以被比例因子整除。

数组的值在 [0, 1, ... max] 中，其中 max 始终小于 6。我想将其按比例缩小，以便每个形状为“scale_factor”的 3d 块返回的数字在这个区块中出现最多。当相等时返回第一个（我不在乎）。

我尝试了以下方法

import numpy as np

array = np.random.randint(0, 4, ((128, 128, 128)), dtype='uint8')
scale_factor = (4, 4, 4)
bincount = 3

# Reshape to free dimension of size scale_factor to apply scaledown method to
m, n, r = np.array(array.shape) // scale_factor
array = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))


# Making histogram, first over last axis, then sum over other two
array = np.apply_along_axis(lambda x: np.bincount(x, minlength=bincount),
                            axis=5, arr=array)
array = np.apply_along_axis(lambda x: np.sum(x), axis=3, arr=array)
array = np.apply_along_axis(lambda x: np.sum(x), axis=1, arr=array).astype('uint8')

array = np.argmax(array , axis=3)

这行得通，但是 bincount 非常慢。也让 np.histogram 工作，但也很慢。我确实认为我尝试过的两种方法都不是完全为我的目的而设计的，它们提供了更多的特性，这些特性会减慢这些方法的速度。

我的问题是，有人知道更快的方法吗？如果有人可以向我指出深度学习库中的一种方法，我也会很高兴，但这不是正式的问题。

【问题讨论】：

apply_along_axis 在我的理解中是一个花哨的 python for 循环。
我的意思是让你慢下来的不是 bincount

标签： python numpy numpy-ndarray downsampling

【解决方案1】：

@F.Wessels 正在朝着正确的方向思考，但答案还不完全存在。如果您自己动手而不是使用沿轴应用，则速度可以提高一百倍以上。

首先，当您将 3D 数组空间划分为块时，您的尺寸会从 128x128x128 变为 32x4x32x4x32x4。尝试直观地理解这一点：您实际上拥有 32x32x32 大小为 4x4x4 的块。与其将块保持为 4x4x4，不如将它们压缩为 64 大小，从中可以找到最常见的项目。这是诀窍：如果您的块不是排列为 32x32x32x64 而是排列为 32768x64 也没有关系。基本上，我们已经回到二维维度，一切都变得更容易了。

现在有了大小为 32768x64 的 2D 数组，您可以使用列表理解和 numpy 操作自己进行 bin 计数；它会快10倍。

import time
import numpy as np

array = np.random.randint(0, 4, ((128, 128, 128)), dtype='uint8')
scale_factor = (4, 4, 4)
bincount = 4

def prev_func(array):
    # Reshape to free dimension of size scale_factor to apply scaledown method to
    m, n, r = np.array(array.shape) // scale_factor
    arr = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))
    arr = np.swapaxes(arr, 1, 2).swapaxes(2, 4)
    arr = arr.reshape((m, n, r, np.prod(scale_factor)))
    # Obtain the element that occurred the most
    arr = np.apply_along_axis(lambda x: max(set(x), key=lambda y: list(x).count(y)),
                              axis=3, arr=arr)
    return arr

def new_func(array):
    # Reshape to free dimension of size scale_factor to apply scaledown method to
    m, n, r = np.array(array.shape) // scale_factor
    arr = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))
    arr = np.swapaxes(arr, 1, 2).swapaxes(2, 4)
    arr = arr.reshape((m, n, r, np.prod(scale_factor)))
    # Collapse dimensions
    arr = arr.reshape(-1,np.prod(scale_factor))
    # Get blockwise frequencies -> Get most frequent items
    arr = np.array([(arr==b).sum(axis=1) for b in range(bincount)]).argmax(axis=0)
    arr = arr.reshape((m,n,r))
    return arr

N = 10

start1 = time.time()
for i in range(N):
    out1 = prev_func(array)
end1 = time.time()
print('Prev:',(end1-start1)/N)

start2 = time.time()
for i in range(N):
    out2 = new_func(array)
end2 = time.time()
print('New:',(end2-start2)/N)

print('Difference:',(out1-out2).sum())

输出：

Prev: 1.4244404077529906
New: 0.01667332649230957
Difference: 0

如您所见，结果没有差异，即使我已经调整了维度。当我转到 2D 时，Numpy 的 reshape 函数保持了值的顺序，因为最后一个维度 64 被保留了。当我重塑回 m,n,r 时，这个顺序会重新建立。您提供的原始方法在我的机器上运行大约需要 5 秒，因此根据经验，速度提高了 500 倍。

【讨论】：

counts = np.zeros_like(arr); np.add.at(counts.reshape(-1), (arr + np.arange(len(arr))[:, None] * bincount).flatten(), 1);返回 counts.argmax(axis=1)
上面只做了一个大小为 128,bincount 的 alloc；从而避免 O(bincount**2) 复杂度。对于小的 bincount 来说可能没什么大不了的；但它可能会很快加起来。
太棒了。这比将scipy.stats.mode 应用于同一事物快 15 倍。对于少数可能性来说并不完全令人惊讶。

【解决方案2】：

嗯，这是一个类似的方法，但速度更快。它仅根据您的用例将 bincount 函数替换为更简单的实现：lambda x: max(set(x), key=lambda y: list(x).count(y)) 其中首先对数组进行整形，以便该方法可以直接在一维上使用。

在我的 128x128x128 笔记本电脑上，它的速度大约快了 4 倍：

import time
import numpy as np

array = np.random.randint(0, 4, ((128, 128, 128)), dtype='uint8')
scale_factor = (4, 4, 4)
bincount = 4

start_time = time.time()
N = 10
for i in range(N):

    # Reshape to free dimension of size scale_factor to apply scaledown method to
    m, n, r = np.array(array.shape) // scale_factor
    arr = array.reshape((m, scale_factor[0], n, scale_factor[1], r, scale_factor[2]))
    arr = np.swapaxes(arr, 1, 2).swapaxes(2, 4)
    arr = arr.reshape((m, n, r, np.prod(scale_factor)))

    # Obtain the element that occurred the most
    arr = np.apply_along_axis(lambda x: max(set(x), key=lambda y: list(x).count(y)),
                              axis=3, arr=arr)

print((time.time() - start_time) / N)

与例如 np.mean() 等内置方法仍有很大差距

【讨论】：