交叉口复杂度答案

【问题标题】：Intersection complexity交叉口复杂度
【发布时间】：2011-12-27 11:57:26
【问题描述】：

在 Python 中你可以得到两个集合的交集：

>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])

有人知道这个交集 (&) 算法的复杂性吗？

编辑：另外，有谁知道 Python 集背后的数据结构是什么？

【问题讨论】：

标签： python set complexity-theory

【解决方案1】：

intersection algorithm 总是以 O(min(len(s1), len(s2))) 运行。

在纯 Python 中，它看起来像这样：

    def intersection(self, other):
        if len(self) <= len(other):
            little, big = self, other
        else:
            little, big = other, self
        result = set()
        for elem in little:
            if elem in big:
                result.add(elem)
        return result

[对附加编辑中问题的回答]集合后面的数据结构是hash table。

【讨论】：

根据我上面链接的 wiki，您的代码中 elem in big 的最坏情况是 O(n)（尽管平均值当然是 O(1)）。这是 O(len(s)*len(t)) 的交集最坏情况的基础。知道为什么吗？
“最坏情况”假定数据不适合在 dict 和 set 使用的哈希表中使用。数据必须是这样的，即每个数据都具有完全相同的哈希值——这将迫使哈希表执行类似于线性搜索的操作以执行 __contains__ 检查。 IOW，我根本不会担心这个。设置交集的速度非常快——它甚至可以重用内部存储的哈希值，因此不需要对 hash() 进行任何调用。
链接到 3.x 代码：here 它适用于 3.9。

【解决方案2】：

答案似乎是a search engine query away。你也可以使用这个direct link to the Time Complexity page at python.org。快速总结：

Average:     O(min(len(s), len(t))
Worst case:  O(len(s) * len(t))

编辑：正如雷蒙德在下面指出的那样，“最坏情况”的情况不太可能发生。我最初包含它是为了彻底，我留下它是为了为下面的讨论提供背景，但我认为 Raymond 是对的。

【讨论】：

这是最糟糕的情况，不是吗？
它看起来不像首先使用排序（这要求对象具有排序），而只是进行“哈希探测”：也许是为了更好C 和平均（并且无订购要求）。 AFAIK 所需的最大复杂度约为O(n log n) + O(n)，带有排序。但是，Big-O 是一个上限，并且有实际考虑，所以......
我认为这里的主要问题是该集合是一个无序集合。在 C++ 中，您可以在 2*(L1+L2)-1 中创建一个交集（有两个排序列表）。这是一个该死的好复杂性！ cplusplus.com/reference/algorithm/set_intersection
这个答案在“最坏情况”时间方面有些误导。不要让它让你远离完美的算法。
@user124384 有趣的是，第一个搜索结果是该超链接中的此线程

【解决方案3】：

设置两组尺寸m,n的交集可以通过以下方式与O(max{m,n} * log(min{m,n}))实现：假设m << n

1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
    3.1 Sort the next **m** items on the bigger list(cost: m*logm)
    3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set

步骤 3 中的循环将运行 n/m 迭代，每次迭代将采用 O(m*logm)，因此对于 m ，您的时间复杂度为 O(nlogm)

我认为这是存在的最佳下限

【讨论】：