如何获取数组 A 中与数组 B 中唯一值相对应的所有最大值的索引？答案

【问题标题】：How to obtain the indices of all maximum values in array A that correspond to unique values in array B?如何获取数组 A 中与数组 B 中唯一值相对应的所有最大值的索引？
【发布时间】：2018-05-09 18:54:50
【问题描述】：

假设有一个观察时间数组ts，每个观察时间对应vs中的某个观察值。观察时间被视为经过的小时数（从零开始）并且可以包含重复项。我想找到与每个唯一观察时间的最大观察值相对应的索引。 我要求的是索引而不是值，unlike a similar question 几个月前我问过。这样，我可以在各种数组上应用相同的索引。下面是一个示例数据集，我想用它来调整代码以适应更大的数据集。

import numpy as np
ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])

我目前的方法是在没有重复时间的任何点拆分值数组。

condition = np.where(np.diff(ts) != 0)[0]+1
ts_spl = np.split(ts, condition)
vs_spl = np.split(vs, condition)

print(ts_spl)
>> [array([0, 0]), array([1]), array([2]), array([3, 3, 3]), array([4, 4]), array([5]), array([6]), array([7]), array([8, 8]), array([9]), array([10])]

print(vs_spl)
>> [array([500, 600]), array([550]), array([700]), array([500, 500, 450]), array([800, 900]), array([700]), array([600]), array([850]), array([850, 900]), array([900]), array([900])]

在这种情况下，应计算任何重复时间的重复最大值。鉴于此示例，返回的索引将是：

[1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15]
# indices = 4,5,6 correspond to values = 500, 500, 450 ==> count indices 4,5
# I might modify this part of the algorithm to return either 4 or 5 instead of 4,5 at some future time

虽然我还不能根据我的目的调整这个算法，但我认为必须可以利用vs_spl 中每个先前拆分的数组的大小来保留一个索引计数器。这种方法对于大型数据集是否可行（填充前每个数组 10,000 个元素；填充后每个数组 70,000 个元素）？如果是这样，我该如何适应它？如果不是，还有哪些其他方法可能在这里有用？

【问题讨论】：

标签： python-3.x numpy duplicates max unique

【解决方案1】：

70,000 并没有那么大，所以是的，它应该是可行的。但是，避免拆分并使用相关 ufunc 的 .reduceat 方法会更快。 reduceat 就像 reduce 应用于块，但您不必提供块，只需告诉 reduceat 您将在哪里切割以获得它们。比如像这样

import numpy as np


N = 10**6
ts = np.cumsum(np.random.rand(N) < 0.1)
vs = 50*np.random.randint(10, 20, (N,))

#ts = np.array([0, 0, 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10])
#vs = np.array([500, 600, 550, 700, 500, 500, 450, 800, 900, 700, 600, 850, 850, 900, 900, 900])


# flatnonzero is a bit faster than where
condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)]
sizes = np.diff(condition)
maxima = np.repeat(np.maximum.reduceat(vs, condition[:-1]), sizes)
maxat = maxima == vs
indices = np.flatnonzero(maxat)
# if you want to know how many maxima at each hour
nmax = np.add.reduceat(maxat, condition[:-1])

【讨论】：

目前在移动设备上。我可以在大约一个小时内测试和玩这个。谢谢！
我想我除了condition = np.r_[0, np.flatnonzero(np.diff(ts)) + 1, len(ts)] 之外的所有内容都遵循了。据我了解，np.flatnonzero 按时间顺序返回非零值的索引，您可以对照连续观察时间检查这些索引。您关于.reduceat 的提示很有帮助。从文档中，我看到np.r_ 可以构建数组，但是你能解释一下它在这一行中的用法吗？
flatnonzero 与代码中的where 完全相同。 r_ 应用于向量和标量只是将它们连接起来，因此在这种情况下，我们在左侧添加零，在右侧添加长度。这样，我们不仅有内部边界，也有外部边界。这很有用，例如当我们想要计算块的大小时，就像我们在下一行中所做的那样。