在二维 numpy 数组的每一行中选择第一个“n”个非重复元素的有效方法答案

【问题标题】：Efficient way to pick first 'n' non-repeating elements in every row of a 2d numpy array在二维 numpy 数组的每一行中选择第一个“n”个非重复元素的有效方法
【发布时间】：2021-10-05 16:32:21
【问题描述】：

我有一个 2d numpy 整数数组，我想在每一行中选择前 5 个唯一元素。

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
   [  1,  36, 156, 152,  152,  37,  46, 143, 141, 114,  25, 134],
   [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

注意第一行和第二行中的重复元素。重复元素彼此相邻出现。输出应该是

array([[193, 64, 139, 180, 104], [1, 36, 156, 152, 37], [110, 96, 52, 53, 35]])

这是一个示例数组，实际数组有 20,000 行。我正在寻找一种不使用循环的有效方法。提前致谢。

【问题讨论】：

标签： python arrays numpy

【解决方案1】：

更新

要摆脱for 循环（我使用它是因为使用break 语句会使程序更高效），您可以使用itertools.takewhile() 方法来执行操作作为列表理解中的break 语句，从而使程序更高效（我测试了两个版本的代码，一个使用itertools.takewhile() 方法，一个没有；前者结果更快） ：

import numpy as np
from itertools import groupby, takewhile

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

result = [[k[0] for i, k in takewhile(lambda x: x[0] != 5, enumerate(groupby(row)))] for row in a]
print(np.array(result))

输出：

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

（使用for 循环）

您可以尝试使用内置的enumerate() 函数和itertools.groupby() 方法：

import numpy as np
from itertools import groupby

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

def get_unique(a, amt):
    for row in a:
        r = []
        for i, k in enumerate(groupby(row)):
            if i == amt:
                break
            r.append(k[0])
        yield r

for row in get_unique(a, 5):
    print(row)

输出：

[193, 64, 139, 180, 104]
[1, 36, 156, 152, 37]
[110, 96, 52, 53, 35]

省略函数：

import numpy as np
from itertools import groupby

a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92],
              [  1,  36, 156, 152, 152,  37,  46, 143, 141, 114,  25, 134],
              [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])

result = []
for row in a:
    r = []
    for i, k in enumerate(groupby(row)):
        if i == 5:
            break
        r.append(k[0])
    result.append(r)

print(np.array(result))

输出：

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

【讨论】：

谢谢！但我正在寻找不使用循环的解决方案。
@ErnestSKirubakaran 我已经更新了排除循环的答案。
没有循环的解决方案提供了最佳性能。谢谢！

【解决方案2】：

试试groupby:

from itertools import groupby
>>> np.array([np.array([k for k, g in groupby(row)])[:5] for row in a])
array([[193,  64, 139, 180, 104],
       [  1,  36, 156, 152,  37],
       [110,  96,  52,  53,  35]])

【讨论】：

感谢您的回答。 %timeit np.array([np.array([k for k, g in groupby(row)])[:5] for row in a]) 给出输出 374 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)。寻找更有效的方法。
未经编辑的帖子说“不重复”。

【解决方案3】：

单独使用 numpy 可以矢量化独特的函数，但它也需要填充，并且还需要保持顺序。然后只需获取结果的前 5 列：

np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]

>>> a = np.array([[193,  64,  64, 139, 180, 180, 104, 152,  69,  22, 192,  92], [  1,  36, 156, 152,  152,  37,  46, 143, 141, 114,  25, 134], [110,  96,  52,  53,  35, 147,   3, 116,  20,  11, 137,   5]])
>>> np.apply_along_axis(lambda x: np.pad(u := x[np.sort(np.unique(x, return_index=1)[1])], (0, a[0].size-u.size)), 1, a)[:,:5]
[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

【讨论】：

【解决方案4】：

试试numpy.apply_along_axis + itertools.groupby + itertools.islice：

import numpy as np
from itertools import groupby, islice

a = np.array([[193, 64, 64, 139, 180, 180, 104, 152, 69, 22, 192, 92],
              [1, 36, 156, 152, 152, 37, 46, 143, 141, 114, 25, 134],
              [110, 96, 52, 53, 35, 147, 3, 116, 20, 11, 137, 5]])


first_5_unique = lambda x: [k for k, _ in islice(groupby(x), 5)]
res = np.apply_along_axis(first_5_unique, axis=1, arr=a)
print(res)

输出

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

或者，一个仅使用numpy.argpartition 和numpy.argsort 的numpy：

def first_k_unique(arr, k, axis=1):
    val = (np.diff(arr) != 0) * np.arange(start=10, stop=-1, step=-1) * -1
    ind = np.argpartition(val, k, axis=axis)[:, :k]
    res = np.take_along_axis(arr, indices=ind, axis=axis)
    return np.take_along_axis(res, np.take_along_axis(val, indices=ind, axis=axis).argsort(axis), axis)


print(first_k_unique(a, 5))

输出

[[193  64 139 180 104]
 [  1  36 156 152  37]
 [110  96  52  53  35]]

numpy only 解决方案的核心解释，可以看here。

【讨论】：