不幸的是,numpy 没有针对此问题的内置解决方案。唯一的方法是要么使用一些巧妙的分配,要么编写自己的自定义方法。
使用循环检测、用于记住索引的附加集和用于缓存轴的辅助数组,我为此编写了一个自定义方法,该方法对于重新排序大 ndarrays 应该很有用:
import numpy as np
def put_at(index, axis=-1, slc=(slice(None),)):
"""Gets the numpy indexer for the given index based on the axis."""
return (axis < 0)*(Ellipsis,) + axis*slc + (index,) + (-1-axis)*slc
def reorder_inplace(array, new_order, axis=0):
"""
Reindex (reorder) the array along an axis.
:param array: The array to reindex.
:param new_order: A list with the new index order. Must be a valid permutation.
:param axis: The axis to reindex.
"""
if np.size(array, axis=axis) != len(new_order):
raise ValueError(
'The new order did not match indexed array along dimension %{0}; '
'dimension is %{1} but corresponding boolean dimension is %{2}'.format(
axis, np.size(array, axis=axis), len(new_order)
)
)
visited = set()
for index, source in enumerate(new_order):
if index not in visited and index != source:
initial_values = np.take(array, index, axis=axis).copy()
destination = index
visited.add(destination)
while source != index:
if source in visited:
raise IndexError(
'The new order is not unique; '
'duplicate found at position %{0} with value %{1}'.format(
destination, source
)
)
array[put_at(destination, axis=axis)] = array.take(source, axis=axis)
destination = source
source = new_order[destination]
visited.add(destination)
array[put_at(destination, axis=axis)] = initial_values
例子:
In[4]: a = np.arange(15).reshape(3, 5)
In[5]: a
Out[5]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
在轴上重新排序0:
In[6]: reorder_inplace(a, [2, 0, 1], axis=0)
In[7]: a
Out[7]:
array([[10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9]])
在轴上重新排序1:
In[10]: reorder_inplace(a, [3, 2, 0, 4, 1], axis=1)
In[11]: a
Out[11]:
array([[ 3, 2, 0, 4, 1],
[ 8, 7, 5, 9, 6],
[13, 12, 10, 14, 11]]
1000 x 1000 小数组的时序和内存
In[5]: a = np.arange(1000 * 1000).reshape(1000, 1000)
In[6]: %timeit reorder_inplace(a, np.random.permutation(1000))
8.19 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[7]: %memit reorder_inplace(a, np.random.permutation(1000))
peak memory: 81.75 MiB, increment: 0.49 MiB
In[8]: %timeit a[:] = a[np.random.permutation(1000), :]
3.27 ms ± 9.49 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[9]: %memit a[:] = a[np.random.permutation(1000), :]
peak memory: 89.56 MiB, increment: 0.01 MiB
对于小数组,内存消耗差别不大,但numpy版本快很多。
20000 x 20000 的时间和内存
In[5]: a = np.arange(20000 * 20000).reshape(20000, 20000)
In[6]: %timeit reorder_inplace(a, np.random.permutation(20000))
1.16 s ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[7]: %memit reorder_inplace(a, np.random.permutation(20000))
peak memory: 3130.77 MiB, increment: 0.19 MiB
In[8]: %timeit a[:] = a[np.random.permutation(20000), :]
1.84 s ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[9]: %memit a[:] = a[np.random.permutation(20000), :]
peak memory: 6182.80 MiB, increment: 3051.76 MiB
当数组的大小增加一个档次时,numpy 版本会变得更慢。 numpy 版本的内存消耗也很高。自定义就地重新排序使用的数量可以忽略不计。