在数组中查找重复项的有效方法[关闭]答案

【问题标题】：Efficient method to find duplicates in arrays [closed]在数组中查找重复项的有效方法[关闭]
【发布时间】：2021-08-16 03:56:49
【问题描述】：

我正在使用 Python 编写一个程序，该程序现在逐行遍历 2D numpy 数组，并在不同的数组中查找相同的行。如果找到重复项，它将使用第一个数组的索引运行一小段代码。

当数组很小（~2x500 和 2x500）时，这可以正常工作并且足够高效，但对于较长的数组很快就会变得低效。我想知道是否有人知道使用 numpy 的方法（我目前正在其他地方使用其他 numpy 功能，因此最好不必更改数据类型），或者可能是其他更有效的方法。我确信有一些比通过数组的两个 for 循环更快的东西。提前致谢。

import random
import numpy as np
N = 1000
speed = 50
longueur = 20000          
largeur =  30000          
quadrillage = 50 
p= 0.8               
def stick():
    u = random.random()
    if u <p:
        a = 1   #The particle is stuck
    else:
        a =0    #The particle did not stick, it will instead bounce
    return a 
obstacle_number =2000   
maxstuck = 4 
numbstuck = np.zeros((obstacle_number)) 

spacinglarg = largeur/quadrillage
spacinglong = longueur/quadrillage
obs0 = np.random.randint(0, spacinglarg,(obstacle_number,1)) *quadrillage
obs1 =  np.random.randint(0, spacinglong,(obstacle_number,1)) *quadrillage
obs = np.concatenate([obs0,obs1], axis =1)

s=(N,2)
global A
A = np.zeros(s)
for i in range (0,N):
    a = i*longueur/N
    b = 50
    A[i,0]= b
    A[i,1]= a


T = 50*np.round(A/(50))

B=np.zeros(s)
tp = 2*np.pi
for i in range(0,nombre_atomes):
    aa = random.randint(0,360)/tp
    B[i,0]=np.cos(aa)*speed
    B[i,1]=np.sin(aa)*speed


for i in range(0, N):
    for j in range(0,len(obs)):
        if T[i,0] == obs[j,0] and T[i,1] == obs[j,1]:
            if numbstuck[j] <= maxstuck and abs(B[i,0]) != 0:    
                sss= stick()
                if sss == 1: #if it sticks
                    B[i,0]=0
                    B[i,1]=0
                    numbstuck[j] += 1
                else:
                    B[i,0]=-B[i,0] 
                    B[i,1]=-B[i,1]

【问题讨论】：

请提供minimal reproducible example。如果我将您的代码原样粘贴到我的编辑器中，它将引发名称错误
感谢，我将编辑问题。
当你这样做时请ping我。您的程序很可能可以完全矢量化，这意味着加速了 10-100 倍
@MadPhysicist 我更新了代码。它为我运行。为了简洁，我省略了一些部分，但现在功能可能有点不清楚。这两个 for 循环实际上是在一个被重复调用（最多 500 次）的函数中。最上面的代码只运行一次并初始化模拟。干杯凯尔
谢谢。我会玩它，并可能让它跑得快

标签： python arrays performance numpy for-loop

【解决方案1】：

大体思路就是简单的写下代码：

import numpy as np

size = 42
a = np.arange(size**2).reshape(size, size)
b = a.copy() + size * 5

def detect_duplicates(a, b):
    duplicates = []
    for i, row_a in enumerate(a):
        for j, row_b in enumerate(b):
            if np.all(row_a == row_b):
                duplicates.append((i, j))
    return duplicates

不过，这很慢：

In [1]: %timeit detect_duplicates(a, b)
7.1 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是使用numba 可以大大加快速度，而无需更改循环中的一行代码：

import numpy as np
import numba

size = 42
a = np.arange(size**2).reshape(size, size)
b = a.copy() + size * 5

@numba.njit
def detect_duplicates(a, b):
    duplicates = []
    for i, row_a in enumerate(a):
        for j, row_b in enumerate(b):
            if np.all(row_a == row_b):
                duplicates.append((i, j))
    return duplicates

现在更快了：

In [1]: %timeit detect_duplicates(a, b)
235 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：