按重复列值删除行答案

【问题标题】：Remove rows by duplicate column(s) values按重复列值删除行
【发布时间】：2018-11-27 17:38:37
【问题描述】：

我在numpy.ndarray 中有一个大型数据集，类似于：

array([[ -4,   5,   9,  30,  50,  80],
       [  2,  -6,   9,  34,  12,   7],
       [ -4,   5,   9,  98, -21,  80],
       [  5,  -9,   0,  32,  18,   0]])

我想删除第 0、第 1、第 2 和第 5 列相等的重复行。 IE。在上述矩阵中，响应为：

-4, 5, 9, 30, 50, 80
2, -6, 9, 34, 12, 7
5, -9, 0, 32, 18, 0

numpy.unique 做了一些非常相似的事情，但它只在所有列（轴）上找到重复项。我只想要特定的列。使用numpy 将如何解决这个问题？我找不到任何像样的numpy 算法来做到这一点。有没有更好的模块？

【问题讨论】：

标签： python python-3.x numpy matrix multidimensional-array

【解决方案1】：

在切片数组上使用np.unique，return_index 参数超过axis=0，这为我们提供了唯一的索引，将每一行视为一个实体。这些索引随后可用于对原始数组进行行索引以获得所需的输出。

因此，以a 作为输入数组，它将是 -

a[np.unique(a[:,[0,1,2,5]],return_index=True,axis=0)[1]]

运行示例以分解步骤并希望使事情变得清晰 -

In [29]: a
Out[29]: 
array([[ -4,   5,   9,  30,  50,  80],
       [  2,  -6,   9,  34,  12,   7],
       [ -4,   5,   9,  98, -21,  80],
       [  5,  -9,   0,  32,  18,   0]])

In [30]: a_slice = a[:,[0,1,2,5]]

In [31]: _, unq_row_indices = np.unique(a_slice,return_index=True,axis=0)

In [32]: final_output = a[unq_row_indices]

In [33]: final_output
Out[33]: 
array([[-4,  5,  9, 30, 50, 80],
       [ 2, -6,  9, 34, 12,  7],
       [ 5, -9,  0, 32, 18,  0]])

【讨论】：

@Divakar 这非常感谢。我不明白几件事。 return_index 应该在最后返回另一行，对吗？这怎么不返回呢？它与他最终的[1] 有关系吗？因为我不明白它在那里做什么。
谢谢。我不知道_, 去了最后一个ndarray。那个逗号叫什么？我想多搜索一下。
@user7331538 _ 只是一个占位符，用于跳过存储为输出变量。这就是为什么np.unique(a_slice,return_index=True,axis=0)[1] 可以像直接获得第二个输出一样工作。

【解决方案2】：

Pandas 通过pd.DataFrame.drop_duplicates 提供了此功能。然而，方便的语法是以性能为代价的。

import pandas as pd
import numpy as np

A = np.array([[ -4,   5,   9,  30,  50,  80],
              [  2,  -6,   9,  34,  12,   7],
              [ -4,   5,   9,  98, -21,  80],
              [  5,  -9,   0,  32,  18,   0]])

res = pd.DataFrame(A)\
        .drop_duplicates(subset=[0, 1, 2, 5])\
        .values

print(res)

array([[-4,  5,  9, 30, 50, 80],
       [ 2, -6,  9, 34, 12,  7],
       [ 5, -9,  0, 32, 18,  0]])

【讨论】：

【解决方案3】：

您可以使用np.take 方法（https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.take.html）从数组中获取您关心的唯一列，然后使用带有return_index=True 的唯一方法。

>>> arr = np.array([[ -4,   5,   9,  30,  50,  80],
...        [  2,  -6,   9,  34,  12,   7],
...        [ -4,   5,   9,  98, -21,  80],
...        [  5,  -9,   0,  32,  18,   0]])
>>> relevant_columns = np.take(arr, [0,1,2,5], axis=1)
>>> np.unique(relevant_columns, axis=0, return_index=True)
(array([[ 2, -6,  9,  7],
       [ 5, -9,  0,  0],
       [-4,  5,  9, 80]]), array([1, 3, 0]))

然后您可以将 np.take() 再次与原始 numpy 数组一起使用。传递 array([1, 3, 0]) 作为索引的参数。

【讨论】：