在 python 中使用 numpy/pandas 的逐行重复条目的索引号答案

【问题标题】：Index numbers for row-wise duplicate entries using numpy/pandas in python在 python 中使用 numpy/pandas 的逐行重复条目的索引号
【发布时间】：2017-12-08 09:56:00
【问题描述】：

我有一个 numpy 数组/pandas 数据框

[[0 0 0 1],
 [1 0 0 1],
 [0 0 0 1],
 [1 0 0 1],
 [0 0 0 1],
 [0 0 1 0],
 [0 0 1 0]]

我需要此数组的行重复索引数。结果应该类似于 (0,2,4), (1,3), (5,6)。

到目前为止，我有一个解决方法，就像我正在运行循环一样，其中一个数组行的唯一值与实际数组行相对。这给了我结果，但不是我想要的那样。这是我编写的代码，它给了我配对，但对于一个大数组，这非常混乱。

for i, row in enumerate(array):
    for j, row1 in enumerate(unique(array)):
        if tuple(row)==tuple(row1):
            pair.append(tuple([j,i]))

我的结果如下所示：

 [(0, 276),(1, 2931),(2, 3891),(3, 2165),(4, 1822),(5, 1241),
 (5, 2635),(5, 2644),(5, 2862),(5, 3296)]

我的数组非常大，所以我手动选择基于第一个值的元组作为重复的指标，然后我选择实际的行号是重复的。例如。 - 第一个值为 5 的元组平均第 1241 行在 2635、2644、2862 和 3296 处重复。

谁能建议我解决这个问题的更好方法。我在这里环顾四周，但没有得到任何具体的东西。

【问题讨论】：

标签： python-2.7 pandas numpy indexing

【解决方案1】：

a 是您的数组，一种有效的方法是将每一行视为字节，以加速行比较：

v=np.array(a)   
rows=v.view(dtype=np.void(v.strides[0]))

例如：

In [4]: a,b=randint(0,1,(2,10000))

In [5]: %timeit tuple(a)==tuple(b)
100 loops, best of 3: 3.12 ms per loop

In [6]: %timeit str(a)==str(b)
1000 loops, best of 3: 901 µs per loop

In [7]: %timeit typ=np.void(a.strides[0]);a.view(typ)==b.view(typ)
1000 loops, best of 3: 227 µs per loop

rows 现在是：

array([[[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]],
       [[1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]],
       [[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]],
       [[1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]],
       [[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]],
       [[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]],
       [[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]]], 
      dtype='|V16')

然后您可以对它们进行排序，使用np.unique 的return_inverse 参数定位基本成员：

uniq,inverse=np.unique(rows,return_inverse=True)

并漂亮地打印结果：

In [28]: [(a[i] , list((inverse==i).nonzero()[0])) for i in range(uniq.size)]
Out[28]: [([0, 0, 0, 1], [0, 2, 4]), ([1, 0, 0, 1], [5, 6]), ([0, 0, 0, 1], [1, 3])]

【讨论】：

【解决方案2】：

我会将数组转换为字符串，然后在原始数组中找到唯一字符串的索引。

让我们使用你的数组：

a = [[0, 0, 0, 1],
     [1, 0, 0, 1],
     [0, 0, 0, 1],
     [1, 0, 0, 1],
     [0, 0, 0, 1],
     [0, 0, 1, 0],
     [0, 0, 1, 0]]


for unique in np.unique([str(el) for el in a]):
    print np.where(np.array([str(el) for el in a]) == str(unique))[0]

这将输出：

[0 2 4]
[5 6]
[1 3]

如你所愿

【讨论】：

这很好，但像我的原始代码一样慢。感谢您的回复，将尝试其他解决方案

【解决方案3】：

numpy_indexed 包（免责声明：我是它的作者）旨在用这种功能丰富 numpy，使用它，您的问题可以写成简单易读的单行：

import numpy_indexed as npi
idx_groups = npi.group_by(array).split(np.arange(len(array)))

请注意，这些指数实际上并不是您所追求的最终结果，而是后续计算所需要的； numpy_indexed 对于那些常见的情况也有很多功能；因此，也许如果您为您的问题提供更多背景信息，也可以提供更完善的解决方案。

【讨论】：

谢谢你，我会试试这个包的功能看看它有什么帮助。