使用 numpy 进行大数组搜索答案

【问题标题】：large array searching with numpy使用 numpy 进行大数组搜索
【发布时间】：2014-08-24 17:37:29
【问题描述】：

我有两个整数数组

a = numpy.array([1109830922873, 2838383, 839839393, ..., 29839933982])
b = numpy.array([2838383, 555555555, 2839474582, ..., 29839933982])

len(a) ~ 15,000 和 len(b) ~ 200 万。

我想要的是找到与数组 a 中的元素匹配的数组 b 元素的索引。现在，我正在使用列表理解和numpy.argwhere() 来实现这一点：

bInds = [ numpy.argwhere(b == c)[0] for c in a ]

但是，显然，完成此操作需要很长时间。而且数组 a 也会变大，所以这不是明智的选择。

考虑到我在这里处理的大型数组，有没有更好的方法来实现这个结果？目前执行此操作大约需要 5 分钟。需要任何加速！

更多信息：我希望索引也匹配数组 a 的顺序。（谢谢查尔斯）

【问题讨论】：

也许你可以创建一个 hashmap 映射元素从a 到它们各自的索引。然后你只需要在地图上查找它们。

标签： python arrays search numpy indices

【解决方案1】：

这需要大约一秒钟的时间来运行。

import numpy

#make some fake data...
a = (numpy.random.random(15000) * 2**16).astype(int)
b = (numpy.random.random(2000000) * 2**16).astype(int)

#find indcies of b that are contained in a.
set_a = set(a)
result = set()
for i,val in enumerate(b):
    if val in set_a:
        result.add(i)

result = numpy.array(list(result))
result.sort()

print result

【讨论】：

谢谢！但是，我希望索引与 a 中的元素的顺序相同。这使它们与 b 中的顺序相同。这有意义吗？
是的，但你应该澄清你的问题，因为这不清楚。

【解决方案2】：

除非我弄错了，否则您的方法会一次又一次地在整个数组 b 中搜索 a 的每个元素。

或者，您可以创建一个字典，将 b 中的各个元素映射到它们的索引。

indices = {}
for i, e in enumerate(b):
    indices[e] = i                      # if elements in b are unique
    indices.setdefault(e, []).append(i) # otherwise, use lists

然后，您可以使用此映射快速查找来自a 的元素可以在b 中找到的索引。

bInds = [ indices[c] for c in a ]

【讨论】：

我相信这正是我需要的！我想这就是你所说的哈希图的意思？感谢您的宝贵时间。
是的，hashmap 或多或少是字典的另一个词。在 Python 中，它被称为字典，或 dict，或只是 {}，在 Java 中，它是 Map 或 HashMap。很抱歉造成混乱。
您知道如何从 b 中未找到的 a 中添加项目的故障保护吗？
我刚刚创建了一个单独的集合并使用了：[ indices[c] if c in b_set else -99 for c in a ]
您不需要单独的set。在dict 中的查找也是 O(1)，因此您可以只使用 indices[c] if c in indices else -99，或者使用带有默认值的 get，即 indices.get(e, -99)。