在元组数组中搜索值非常慢答案

【问题标题】：Searching for value in an array of tuples is very slow在元组数组中搜索值非常慢
【发布时间】：2015-05-15 20:15:32
【问题描述】：

让我们假设这个任务：

生成大随机数的数组 A。对它们进行排序。然后生成随机数并检查数组A中是否存在这样的数字。重复。如果找到，则返回其在数组 A 中的原始位置（排序前）和数字的值。

示例：排序前的数组A：

+-------+------------------------+
| index | 0 1 2 3  4  5  6  7  8 |
| value | 1 3 9 27 81 17 51 40 7 |
+-------+------------------------+

排序后：

+-------+------------------------+
| index | 0 1 8 2 5  9  3  7  6 |
| value | 1 3 7 9 17 21 27 40 51 |
+-------+------------------------+

数组中是否存在数字 21？是的，在索引 9 上！

我想出了以下解决方案：

def value_exists(needle, haystack):
    # finds if needle exists in haystack of tuples and returns it if so
    for item in haystack:
        if item[1] > needle:
            return None
        if item[1] == needle:
            return item

n = 200000
size = 100000000

# fill array A with random numbers
arrayA = [1]
for i in range(1, n):
    arrayA.append(randint(0, size))
arrayA = enumerate(arrayA)
# sort them by values keeping its indexes
arrayA = sorted(arrayA, key=lambda x: x[1])

# search
for i in range(1, n):
    value = randint(0, size)
    check = value_exists(value, arrayA)
    if check:
        break

if check:
    print(check)

此解决方案有效，但速度极慢。对于设置为100,000,000 的大小，大约需要 30 秒。对于10,000,000,000，我什至无法得到结果（>5 分钟）。

我无法意识到这项任务如此耗时。我知道数字很大，但它们适合 64 位整数。我发现value_exists函数是问题的核心，可以改进吗？

【问题讨论】：

标签： python list sorting python-3.x

【解决方案1】：

为什么不使用数组，而不是使用字典？您可以将随机数存储在key 中，并将索引存储在value 中。

然后，要检查随机数是否在集合中，只需使用in。

例子：

import random

# Create a large list of random numbers
A_list = random.sample(xrange(100000, 999999), 10000)

# EDIT: Forgot to sort the array!
A_list = sorted(A_list)

# Load the numbers in a dictionary
A_dict = {}
for idx, num in enumerate(A_list):
    A_dict[num] = idx

# Now, check if a number exists
if 101337 in A_dict:
    # it exists!
    # Get its index
    return A_dict[101337]

【讨论】：

我不依赖数组的使用。通过“数组”，我提出了数组的一般含义。我会在一秒钟内研究这个解决方案
@thefourtheye 如果你注意到了，我确实生成了大量随机数。我只是将它们加载到字典中以便于搜索。我的解决方案仍然返回列表中数字的索引。
@nivixzixer 即使我想提出这个建议，但后来我认为 OP 只想对列表执行此操作。另外，我会输入字典理解
@nivixzixer 不，排序后你会丢失实际的索引。对于基于字典的解决方案，完全不需要排序。
啊..所以他想从原始列表中找到一个数字的索引？排序前？没听懂，抱歉。

【解决方案2】：

首先，作为一种更有效的方式，您可以在value_exists 函数中使用生成器表达式，您也不需要检查item[1] > needle：

def value_exists(needle, haystack):
    return next(item for item in haystack if item[1] == needle,None)

您可以使用random.sample 创建一个随机列表。例如：

>>> random.sample(range(100),10)
[87, 24, 71, 64, 86, 11, 59, 54, 20, 92]

最后一部分你也可以使用生成器表达式：

next(value_exists(randint(0, size), arrayA) for i in range(1, n),None)

如果有必要对数组进行排序，您可以使用operator.itemgetter() 作为您的key，这对于长列表更有效：

from operator import itemgetter
arrayA = sorted(arrayA, key=itemgetter(1))

【讨论】：