根据给定索引就地numpy数组排序答案

【问题标题】：in-place numpy array sorting according to given index根据给定索引就地numpy数组排序
【发布时间】：2014-10-07 15:34:23
【问题描述】：

有一些问题很接近，但我还没有找到具体的答案。我正在尝试沿给定轴对 numpy 3D 数组进行一些就地排序。我不想要简单的排序，我想根据我自己的索引对数组进行排序。例如

a = np.random.rand((3,3,3))

假设我想根据旧数组的以下索引来使用最后一个维度：

new_order = [1,2,0]

我希望能够说：

a[:,:,new_order] = a

但这并不像预期的那样。有什么建议吗？

【问题讨论】：

您在寻找a = a[:, :, new_order]吗？
好久都忘了你可以在 Matlab 中做到这一点。

标签： python numpy

【解决方案1】：

np.ndarray.sort 是唯一声称就地的类型，它并没有给您太多控制权。

将订单索引放在正确的位置是可行的 - 但可能会产生不可预知的结果。显然它正在执行某种顺序赋值，左侧的较早赋值会影响右侧的值。

In [719]: a=np.arange(12).reshape(3,4)
In [720]: a[:,[0,1,3,2]]=a
In [721]: a
Out[721]: 
array([[ 0,  1,  2,  2],
       [ 4,  5,  6,  6],
       [ 8,  9, 10, 10]])

可以预见，要进行这种分配需要某种缓冲。

In [728]: a[:,[0,1,3,2]]=a.copy()
In [729]: a
Out[729]: 
array([[ 0,  1,  3,  2],
       [ 4,  5,  7,  6],
       [ 8,  9, 11, 10]])

正确的索引可以解决这个问题，但这不是就地的。变量a 指向一个新对象。

In [731]: a=a[:,[0,1,3,2]]
In [732]: a
Out[732]: 
array([[ 0,  1,  3,  2],
       [ 4,  5,  7,  6],
       [ 8,  9, 11, 10]])

不过分配[:] 可以解决这个问题：

In [738]: a=np.arange(12).reshape(3,4)
In [739]: a.__array_interface__
Out[739]: 
{'data': (181868592, False),   # 181... is the id of the data buffer
 'descr': [('', '<i4')],
 'shape': (3, 4),
 'strides': None,
 'typestr': '<i4',
 'version': 3}
In [740]: a[:]=a[:,[0,1,3,2]]
In [741]: a.__array_interface__
Out[741]: 
{'data': (181868592, False),  # same data buffer
 'descr': [('', '<i4')],
 'shape': (3, 4),
 'strides': None,
 'typestr': '<i4',
 'version': 3}
In [742]: a
Out[742]: 
array([[ 0,  1,  3,  2],
       [ 4,  5,  7,  6],
       [ 8,  9, 11, 10]])

a.data id 相同的事实表明这是一个就地操作。但最好使用其他索引对其进行测试，以确保它符合您的要求。

但是，“就地”排序是否必要？如果数组非常大，可能需要避免内存错误。但我们必须测试替代方案，看看它们是否有效。

inplace 如果有其他变量使用相同的数据也很重要。例如

b = a.T # a transpose

使用a[:]=，b 的行将被重新排序。 a 和 b 继续共享相同的 data。对于a=，b 不变。 a 和 b 现在已解耦。

【讨论】：

看起来a[:] = a[:, [0,1,3,2]] 是一个可行的解决方案（它会产生正确的结果），但是这不是就地操作。通过用一个大数组复制它并记录内存使用情况，可以确认 numpy 将数据复制到一个临时缓冲区中，然后再将其放回 a 的内存中。
@Graham501617，作为 Python，首先评估 RHS，然后传递给 a.__setitem__(slice(None), res)。所以是的，有一些缓冲。它不会逐个元素甚至逐列地选择和替换值。但在a 保留其原始数据缓冲区的意义上，它是就地的。通过np.add.at 文档，我们可以更好地了解无缓冲操作的外观。
@Graham501617，更多关于缓冲的内容，这次是ufunc out 参数，stackoverflow.com/questions/70294788/…
a[:] = a[indices] 获取副本，因此不会对 a 采取适当的行动。它正在有效地做x = a[indices]; a[:] = x; del x。

【解决方案2】：

不幸的是，numpy 没有针对此问题的内置解决方案。唯一的方法是要么使用一些巧妙的分配，要么编写自己的自定义方法。

使用循环检测、用于记住索引的附加集和用于缓存轴的辅助数组，我为此编写了一个自定义方法，该方法对于重新排序大 ndarrays 应该很有用：

import numpy as np

def put_at(index, axis=-1, slc=(slice(None),)):
    """Gets the numpy indexer for the given index based on the axis."""
    return (axis < 0)*(Ellipsis,) + axis*slc + (index,) + (-1-axis)*slc


def reorder_inplace(array, new_order, axis=0):
    """
    Reindex (reorder) the array along an axis.

    :param array: The array to reindex.
    :param new_order: A list with the new index order. Must be a valid permutation.
    :param axis: The axis to reindex.
    """
    if np.size(array, axis=axis) != len(new_order):
        raise ValueError(
            'The new order did not match indexed array along dimension %{0}; '
            'dimension is %{1} but corresponding boolean dimension is %{2}'.format(
                axis, np.size(array, axis=axis), len(new_order)
            )
        )

    visited = set()
    for index, source in enumerate(new_order):
        if index not in visited and index != source:
            initial_values = np.take(array, index, axis=axis).copy()

            destination = index
            visited.add(destination)
            while source != index:
                if source in visited:
                    raise IndexError(
                        'The new order is not unique; '
                        'duplicate found at position %{0} with value %{1}'.format(
                            destination, source
                        )
                    )

                array[put_at(destination, axis=axis)] = array.take(source, axis=axis)

                destination = source
                source = new_order[destination]

                visited.add(destination)
            array[put_at(destination, axis=axis)] = initial_values

例子：

In[4]: a = np.arange(15).reshape(3, 5)
In[5]: a
Out[5]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

在轴上重新排序0：

In[6]: reorder_inplace(a, [2, 0, 1], axis=0)
In[7]: a
Out[7]: 
array([[10, 11, 12, 13, 14],
       [ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9]])

在轴上重新排序1：

In[10]: reorder_inplace(a, [3, 2, 0, 4, 1], axis=1)
In[11]: a
Out[11]: 
array([[ 3,  2,  0,  4,  1],
       [ 8,  7,  5,  9,  6],
       [13, 12, 10, 14, 11]]

1000 x 1000 小数组的时序和内存

In[5]: a = np.arange(1000 * 1000).reshape(1000, 1000)
In[6]: %timeit reorder_inplace(a, np.random.permutation(1000))
8.19 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[7]: %memit reorder_inplace(a, np.random.permutation(1000))
peak memory: 81.75 MiB, increment: 0.49 MiB
In[8]: %timeit a[:] = a[np.random.permutation(1000), :]
3.27 ms ± 9.49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[9]: %memit a[:] = a[np.random.permutation(1000), :]
peak memory: 89.56 MiB, increment: 0.01 MiB

对于小数组，内存消耗差别不大，但numpy版本快很多。

20000 x 20000 的时间和内存

In[5]: a = np.arange(20000 * 20000).reshape(20000, 20000)
In[6]: %timeit reorder_inplace(a, np.random.permutation(20000))
1.16 s ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[7]: %memit reorder_inplace(a, np.random.permutation(20000))
peak memory: 3130.77 MiB, increment: 0.19 MiB
In[8]: %timeit a[:] = a[np.random.permutation(20000), :]
1.84 s ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[9]: %memit a[:] = a[np.random.permutation(20000), :]
peak memory: 6182.80 MiB, increment: 3051.76 MiB

当数组的大小增加一个档次时，numpy 版本会变得更慢。 numpy 版本的内存消耗也很高。自定义就地重新排序使用的数量可以忽略不计。

【讨论】：

【解决方案3】：

你来了，

a = a[:, :, new_order]

此外，这里有几个“Matlab 用户的 numpy”页面，我在开始时发现它们很有用：

http://wiki.scipy.org/NumPy_for_Matlab_Users

http://mathesaurus.sourceforge.net/matlab-numpy.html

【讨论】：

但这不是“就地”。 b = a[:,:,new_order] 做同样的事情，除了旧的 a 数组不能免费被垃圾收集。