如何有效地计算一组间隔中一组数字的存在答案

【问题标题】：How to count the presence of a set of numbers in a set of intervals efficiently如何有效地计算一组间隔中一组数字的存在
【发布时间】：2023-03-29 16:13:02
【问题描述】：

输入参数是表示区间的元组列表和整数列表。目标是编写一个函数来计算每个整数存在的间隔数，并将此结果作为关联数组返回。比如：

Input intervals: [(1, 3), (5, 6), (6, 9)]
Input integers: [2, 4, 6, 8]
Output: {2: 1, 4: 0, 6: 2, 8: 1}

其他例子：

Input intervals: [(3, 3), (22, 30), (17, 29), (7, 12), (12, 34), (18, 38), (30, 40), (5, 27), (19, 26), (27, 27), (1, 31), (17, 17), (22, 25), (6, 14), (5, 7), (9, 19), (24, 28), (19, 40), (9, 36), (2, 32)]
Input numbers: [16, 18, 39, 40, 27, 28, 4, 23, 15, 24, 2, 6, 32, 17, 21, 29, 31, 7, 20, 10]
Output: {2: 2, 4: 2, 6: 5, 7: 6, 10: 7, 15: 6, 16: 6, 17: 8, 18: 8, 20: 9, 21: 9, 23: 11, 24: 12, 27: 11, 28: 9, 29: 8, 31: 7, 32: 6, 39: 2, 40: 2}

我将如何编写一个有效地执行此操作的函数？我已经有了 O(nm) 实现，其中 n 为间隔数，m 为整数数，但我正在寻找更有效的方法。

我现在拥有的：

def intervals_per_number(numbers, intervals):
    result_map = {i: 0 for i in numbers}
    for i in result_map.keys():
        for k in intervals:
            if k[0] <= i <= k[1]:
                result_map[i] += 1
    return result_map

希望我解释得足够好。如果还有什么不清楚的，请告诉我。

提前致谢。

【问题讨论】：

仅当列表已排序时：调整二分查找。它在如此小的样本上不会有太大的好处，但它应该在长度为 8 和更大（8，因为这需要最多 3 次查找；对于更大的长度，直到百万，它只会呈指数级增长）。
@usr2564301 如何获得比二次解更好的指数？这听起来不对。
@HeapOverflow：（哎呀）是的，二次优于线性。（“指数”仅乘以 2 的幂，而不是 n。但仍然要好得多。）
@usr2564301 嗯，听起来还是不对。不是 O(n log n) 而不是 O(n^2)，所以 O(n / log(n)) 更好吗？也就是说，小于线性更好？
@HeapOverflow 不，区间和整数并不总是已经排序。

标签： python list

【解决方案1】：

将您的整数、起点和终点放在一个成对列表中。使每对的第一个元素为整数、起点或终点的值，每对的第二个元素为 0、-1 或 1，具体取决于它是整数、起点还是终点。

接下来，对列表进行排序。

现在，您可以遍历列表，维护对中第二个元素的运行总和。当您看到第二个元素为 0 的对时，记录该整数的运行总和（取反）。

在最坏的情况下，这会在 O((N+M)log(N+M)) 时间内运行（实际上我猜如果查询和间隔大部分是排序的，这将是线性的，这要归功于 timsort）。

例如：

Input intervals: [(1, 3), (5, 6), (6, 9)]
Input integers: [2, 4, 6, 8]

Unified list (sorted):
[(1,-1), (2,0), (3,1), (4,0), (5,-1), (6, -1), (6,0), (6,1), (8,0), (9,1)]

Running sum:
[-1    , -1,    0,     0,      -1,    -2,      0,      -1,    -1,   0]

Values for integers:
2: 1, 4: 0, 6: 2, 8, 1

示例代码：

def query(qs, intervals):
    xs = [(q, 0) for q in qs] + [(x, -1) for x, _ in intervals] + [(x, 1) for _, x in intervals]
    S, r = 0, dict()
    for v, s in sorted(xs):
        if s == 0:
            r[v] = S
        S -= s
    return r

intervals = [(3, 3), (22, 30), (17, 29), (7, 12), (12, 34), (18, 38), (30, 40), (5, 27), (19, 26), (27, 27), (1, 31), (17, 17), (22, 25), (6, 14), (5, 7), (9, 19), (24, 28), (19, 40), (9, 36), (2, 32)]
queries = [16, 18, 39, 40, 27, 28, 4, 23, 15, 24, 2, 6, 32, 17, 21, 29, 31, 7, 20, 10]
print(query(queries, intervals))

输出：

{2: 2, 4: 2, 6: 5, 7: 6, 10: 7, 15: 6, 16: 6, 17: 8, 18: 8, 20: 9, 21: 9, 23: 11, 24: 12, 27: 11, 28: 9, 29: 8, 31: 7, 32: 6, 39: 2, 40: 2}

【讨论】：

【解决方案2】：

您可以对integers 进行预排序，然后在下限上使用bisect_left。排序的复杂度为 O(M*log(M))，而 bisect 的复杂度为 O(log(M))。所以有效地你有 O(max(M, N) * log(M))。

import bisect
from collections import defaultdict

result = defaultdict(int)
integers = sorted(integers)
for low, high in intervals:
    index = bisect.bisect_left(integers, low)
    while index < len(integers) and integers[index] <= high:
        result[integers[index]] += 1
        index += 1

【讨论】：

看起来您的复杂性分析忘记了while 循环。
@HeapOverflow 确实，但这取决于间隔的分布。如果它们大部分是不重叠的（或者重叠的数量是恒定的），那么这会增加恒定的时间。

【解决方案3】：

根据用例和上下文，简单的事情可能就足够了：

from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(range(f, t+1) for f,t in input_intervals))
result = {k:counts[k] for k in input_numbers}

O(n*k + m) 其中n 是区间数，k 是区间的平均大小，m 是整数数。

【讨论】：