比较大组数组答案

【问题标题】：compare large sets of arrays比较大组数组
【发布时间】：2018-09-18 05:00:42
【问题描述】：

我有一个由 n 个 1x3 数组组成的 numpy 数组 A，其中 n 是 1x3 数组中元素可能组合的总数，其中每个元素的范围从 0 到 50。也就是说，

 A = [[0,0,0],[0,0,1]...[0,1,0]...[50,50,50]]

和

 len(A) = 50*50*50 = 125000

我有一个由 m 个 1x3 数组组成的 numpy 数组 B，其中 m = 1000 万，并且这些数组的值可以属于 A 所描述的集合。

我想统计每个组合在B中出现了多少次，即[0,0,0]出现在B中的次数，[0,0,1]出现的次数……[50,50,50]出现的次数。到目前为止，我有以下内容：

for i in range(len(A)):
   for j in range(len(B)):
    if np.array_equal(A[i], B[j]):
        y[i] += 1

其中 y 跟踪第 i 个数组出现的次数。所以，y[0] 是 [0,0,0] 在 B 中出现的次数，y[1] 是 [0,0,1] 出现的次数...y[125000] 是 [50,50,50] 出现的次数等等。

问题是这需要很长时间。它必须检查 1000 万个条目，125000 次。有没有更快更有效的方法来做到这一点？

【问题讨论】：

有点挑剔：0, ..., 50 是 51 个数字，所以 len(A) 将是 51^3。

标签： python arrays numpy combinations

【解决方案1】：

这是一种快速的方法。它在几分之一秒内处理了range(50)^3 中的10 百万元组，并且大约比下一个最佳解决方案（@Primusa's）快100 倍：

它使用了这样一个事实，即此类元组与数字0 - 50^3 - 1 之间存在直接的转换。（映射恰好与A 的行和行号之间的映射相同。）np.ravel_multi_index 和 np.unravel_index 函数实现了这种转换及其逆转换。

一旦将B 转换为数字，就可以使用np.bincount 非常有效地确定它们的频率。下面我重塑结果以获得50x50x50 直方图，但这只是口味问题，可以省略。（我冒昧地只使用数字0 到49，所以len(A) 变成125000）：

>>> B = np.random.randint(0, 50, (10000000, 3))
>>> Br = np.ravel_multi_index(B.T, (50, 50, 50))
>>> result = np.bincount(Br, minlength=125000).reshape(50, 50, 50)

让我们看一个较小的示例进行演示：

>>> B = np.random.randint(0, 3, (10, 3))
>>> Br = np.ravel_multi_index(B.T, (3, 3, 3))
>>> result = np.bincount(Br, minlength=27).reshape(3, 3, 3)
>>> 
>>> B
array([[1, 1, 2],
       [2, 1, 2],
       [2, 0, 0],
       [2, 1, 0],
       [2, 0, 2],
       [0, 0, 2],
       [0, 0, 2],
       [0, 2, 2],
       [2, 0, 0],
       [0, 2, 0]])
>>> result
array([[[0, 0, 2],
        [0, 0, 0],
        [1, 0, 1]],

       [[0, 0, 0],
        [0, 0, 1],
        [0, 0, 0]],

       [[2, 0, 1],
        [1, 0, 1],
        [0, 0, 0]]])

例如，查询[2,1,0] 在 B 中的次数是多少

>>> result[2,1,0]
1

如上所述：要将索引转换为您的A 和A 的实际行（这是我的result 的索引），可以使用np.ravel_multi_index 和np.unravel_index。或者您可以省略最后一次整形（即使用result = np.bincount(Br, minlength=125000)；然后计数的索引与A 完全相同。

【讨论】：

【解决方案2】：

您可以使用dict() 来加快此过程，使其仅处理 1000 万个条目。

所以您要做的第一件事是将 A 中的所有子列表更改为可散列对象，您可以将它们用作字典中的键吗？

将所有子列表转换为元组：

A = [tuple(i) for i in A]

然后创建一个dict()，以A中的每个值作为键，值为0。

d = {i:0 for i in A}

现在对于 numpy 数组中的每个子数组，您只需将其转换为元组并将 d[that array] 增加 1

for subarray in B:
    d[tuple(subarray)] += 1

D 现在是一个字典，其中每个键的值是该键在 B 中出现的次数。

【讨论】：

【解决方案3】：

您可以通过在数组B 的第一个轴上调用np.unique 和return_counts=True，从数组B 中找到唯一行及其计数。然后，您可以使用广播在正确的轴上调用ndarray.all 和ndarray.any 方法来查找A 中B 的唯一行的索引。那么你所需要的只是一个简单的索引：

In [82]: unique, counts = np.unique(B, axis=0, return_counts=True)

In [83]: indices = np.where((unique == A[:,None,:]).all(axis=2).any(axis=0))[0]

# Get items from A that exist in B
In [84]: unique[indices]

# Get the counts 
In [85]: counts[indices]

例子：

In [86]: arr = np.array([[2 ,3, 4], [5, 6, 0], [2, 3, 4], [1, 0, 4], [3, 3, 3], [5, 6, 0], [2, 3, 4]])

In [87]: a = np.array([[2, 3, 4], [1, 9, 5], [3, 3, 3]])

In [88]: unique, counts = np.unique(arr, axis=0, return_counts=True)

In [89]: indices = np.where((unique == a[:,None,:]).all(axis=2).any(axis=0))[0]

In [90]: unique[indices]
Out[90]: 
array([[2, 3, 4],
       [3, 3, 3]])

In [91]: counts[indices]
Out[91]: array([3, 1])

【讨论】：

【解决方案4】：

你可以这样做

y=[np.where(np.all(B==arr,axis=1))[0].shape[0] for arr in A]

arr 只是遍历A 和np.all 检查它与B 匹配的位置，np.where 将这些匹配的位置作为一个数组返回，然后shape 只返回该数组的长度或在换句话说就是想要的频率

【讨论】：