二进制numpy数组之间的快速汉明距离计算答案

【问题标题】：Fast hamming distance computation between binary numpy arrays二进制numpy数组之间的快速汉明距离计算
【发布时间】：2015-12-20 05:13:58
【问题描述】：

我有两个包含二进制值的长度相同的 numpy 数组

import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

我想尽可能快地计算它们之间的汉明距离，因为我要进行数百万次这样的距离计算。

一个简单但缓慢的选择是这样的（取自维基百科）：

%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop

受堆栈溢出的一些答案的启发，我提出了更快的选项。

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop

我想知道是否有更快的方法来计算这个，可能使用 cython？

【问题讨论】：

示例数组a和b的长度和你的真实数据长度一样吗？
您是在计算数组数组内的所有成对距离，还是两个数组数组之间的距离？您也许可以使用scipy.spatial.distance.cdist 或scipy.spatial.distance.pdist
@WarrenWeckesser 它们的顺序相同，是的。根据某些参数设置，它们的长度将在 20 到 100 之间。
scipy/spatial/distance.py hamming(u, v): ... return (u != v).mean() 。另请参阅bitarray。

标签： python arrays numpy cython hamming-distance

【解决方案1】：

有一个比len((a != b).nonzero()[0]) 更好的numpy 函数；）

np.count_nonzero(a!=b)

【讨论】：

【解决方案2】：

与我平台上的 np.count_nonzero(a!=b) 的 1.07µs 相比，gmpy2.hamdist 在将每个数组转换为 mpz（多精度整数）后将其降至约 143ns：

import numpy as np
from gmpy2 import mpz, hamdist, pack

a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

根据@casevh 的提示，使用 gmpy2.pack(list(reversed(list(array))),1) 可以合理有效地完成从一维数组 1 和 0 到 gmpy2 mpz 对象的转换。

# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)

hamdist(ampz,bmpz)
Out[8]: 7

%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop

相对比较，在我的平台上：

%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop

%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop

【讨论】：

公平地说，您可能应该包括将输入数组转换为 mpz 格式所需的时间。
您可以使用gmpy2.pack(list(a),1) 将numpy 数组转换为mpz。它比convert2mpz() 快。如果包括转换时间，它仍然会比 numpy 解决方案慢。
如果您想构建与原始代码相同的 mpz，您确实需要使用 reversed()。但是，汉明距离不取决于位的顺序（即高到低与低到高）。只要两个数组的长度相同，以便相互比较相同的位位置，汉明距离就会相同。
有人知道自从这篇文章发布后有什么变化吗？当我在这篇文章中复制粘贴 a & b 的导入和定义后尝试使用 pack 时，出现错误：TypeError: pack() requires list elements be positive integers
对我来说，将 list(a) 更改为 a.tolist()

【解决方案3】：

在这里使用pythran可以带来额外的好处：

$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
    return len(nonzero(a != b)[0])

作为参考（不含pythran）：

$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop

pythran 编译后：

$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop

这大约是 numpy 实现的6x 加速，因为 pythran 在评估元素比较时跳过了中间数组的创建。

我也测量了：

def hamm(a,b):
    return count_nonzero(a != b)

我得到了 Python 版本的 3.11 usec per loop 和 Pythran 版本的 0.427 usec per loop。

免责声明：我是 Pythran 开发人员之一。

【讨论】：

【解决方案4】：

对于字符串来说它工作得更快

def Hamm(a, b):
    c = 0
    for i in range(a.shape[0]):
        if a[i] != b[i]:
            c += 1
    return c

【讨论】：

【解决方案5】：

我建议你使用 np.packbits 将 numpy 位数组转换为 numpy uint8 数组

看看 scipy 的 spatial.distance.hamming： https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

否则，这是一个小 sn-p，它只需要受 Fast way of counting non-zero bits in positive integer 启发的 numpy ：

bit_counts = np.array([int(bin(x).count("1")) for x in range(256)]).astype(np.uint8)
def hamming_dist(a,b,axis=None):
    return np.sum(bit_counts[np.bitwise_xor(a,b)],axis=axis)

axis=-1，这允许获取一个条目和一个大数组之间的哈米格距离；例如：

inp = np.uint8(np.random.random((512,8))*255) #512 entries of 8 byte
hd = hamming_dist(inp, inp[123], axis=-1) #results in 512 hamming distances to entry 123
idx_best = np.argmin(hd)    # should point to identity 123
hd[123] = 255 #mask out identity
idx_nearest= np.argmin(hd)    # should point entry in list with shortest distance to entry 123
dist_hist = np.bincount(np.uint8(hd)) # distribution of hamming distances; for me this started at 18bits to 44bits with a maximum at 31

【讨论】：