根据条件从 numpy 数组中随机选择行答案

【问题标题】：Randomly select rows from numpy array based on a condition根据条件从 numpy 数组中随机选择行
【发布时间】：2020-11-10 09:44:24
【问题描述】：

假设我有 2 个数组，labels 是 1D，data 是 5D 请注意，两个数组具有相同的第一维。

为了简单起见，我们假设 labels 只包含 3 个数组：

labels=np.array([[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]])

假设我有一个 datalist 的 data 数组（长度=3），其中每个数组都有一个 5D 形状，其中每个数组的第一个维度与labels 数组的数组。

在本例中，datalist 有 3 个形状数组：(8,3,100,10,1), (5 ,3,100,10,1) 和 (10,3,100,10,1)。这里，每个数组的第一个维度与 label 中每个数组的长度相同。

现在我想减少每个标签数组中零的数量并保留其他值。假设我只想为每个数组保留 3 个零。因此，labels中每个数组的长度以及data中每个数组的第一个维度将是6，4 和 8。

为了减少每个标签数组中零的数量，我想随机选择并保持只有3个。现在这些相同的随机选择索引将用于从 data 中选择对应的行。

对于本例，new_labels 数组将如下所示：

new_labels=np.array([[0,0,1,1,2,0],[4,0,0,0],[0,3,2,1,0,1,7,0]])

这是我迄今为止尝试过的：

all_ind=[] #to store indexes where value=0 for all arrays
indexes_to_keep=[] #to store the random selected indexes
new_labels=[] #to store the final results

for i in range(len(labels)):
    ind=[] #to store indexes where value=0 for one array
    for j in range(len(labels[i])):
        if (labels[i][j]==0):
            ind.append(j)
    all_ind.append(ind)

for k in range(len(labels)):   
    indexes_to_keep.append(np.random.choice(all_ind[i], 3))
    aux= np.zeros(len(labels[i]) - len(all_ind[i]) + 3)
    ....
    .... 
    Here, how can I fill **aux** with the values ?
    ....
    .... 
    new_labels.append(aux)

有什么建议吗？

【问题讨论】：

标签： python arrays numpy

【解决方案1】：

使用不同长度的 numpy 数组不是一个好主意，因此您需要迭代每个项目并对其执行一些方法。假设您只想优化该方法，掩蔽在这里可能工作得很好：

def specific_choice(x, n):
    '''leaving n random zeros of the list x'''
    x = np.array(x)
    mask = x != 0
    idx = np.flatnonzero(~mask)
    np.random.shuffle(idx) #dynamical change of idx value, quite fast
    idx = idx[:n]
    mask[idx] = True
    return x[mask] # or mask if you need it

列表的迭代比数组之一快，因此有效的用法是：

labels = [[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]]
output = [specific_choice(n, 3) for n in labels]

输出：

[array([0, 1, 1, 2, 0, 0]), array([0, 4, 0, 0]), array([0, 3, 0, 2, 1, 1, 7, 0])]

【讨论】：

很高兴听到这个消息。 np.random.shuffle 确实是一个快速的选择。
如何使用这些精确的随机索引从数据中选择相应的行，因为数据中每个数组的第一个维度与标签中的数组相同？
@MejdiDallel 似乎您可以修改方法的定义来收集掩码，例如：output_of_masks = [specific_choice_masks(n, 3) for n in labels]，然后像这样进行以下理解：[data[mask] for mask in output_of_masks]。
太棒了！我会试试的！抱歉，我是 Python x 的新手）
@MejdiDallel 好的。更要说的是，列表或其他类型的可迭代对象的 Pythonic 机制不允许这种索引。这是numpylibray 的一个特性——它就像一个接口，允许在 Python 中工作并同时在 C 级别执行操作。