在 1.5 秒内找到超过 2000 万个 3 到 4 个不同整数的中位数答案

【问题标题】：Finding Median of more than 20 Million of 3 to 4 different integers in 1.5 seconds在 1.5 秒内找到超过 2000 万个 3 到 4 个不同整数的中位数
【发布时间】：2018-08-28 16:10:26
【问题描述】：

我正在尝试排序并找到仅包含 3 到 4 个不同整数的整数字符串的中位数。

我正在处理的数字数量约为 20 到 2500 万，我应该对向量进行排序，并在每次将新整数添加到向量中时找到中值并将中值添加到单独的“ Total" 变量，每次生成中位数时对所有中位数求和。

1                   Median: 1              Total: 1
1 , 2               Median: (1+2)/2 = 1    Total: 1 + 1 = 2
1 , 2 , 3           Median: 2              Total: 2 + 2 = 4
1 , 1 , 2 , 3       Median: (1+2)/2 = 1    Total: 4 + 1 = 5
1 , 1 , 1 , 2 , 3   Median: 1              Total: 5 + 1 = 6

我正在尝试找到一种方法来进一步优化我的代码，因为它不够高效。（必须在 2 秒左右的时间内处理）有人知道如何进一步加快我的代码逻辑吗？

我目前在 C++ 中使用 2 个堆或优先级队列。一个用作最大堆，另一个用作最小堆。

从Data structure to find median得到这个想法

You can use 2 heaps, that we will call Left and Right.
Left is a Max-Heap.
Right is a Min-Heap.
Insertion is done like this:

If the new element x is smaller than the root of Left then we insert x to 
Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from 
the count of elements of Right, then we call Extract-Max on Left and insert 
it to Right.
Else if after insertion Right has count of elements that is greater than the 
count of elements of Left, then we call Extract-Min on Right and insert it 
to Left.
The median is always the root of Left.

So insertion is done in O(lg n) time and getting the median is done in O(1) 
time.

但是速度还不够快……

【问题讨论】：

如果不同元素的数量非常少，您可以只计算每个元素在线性时间内出现的次数，然后从中计算中位数。
每个数字出现的频率没有限制，只要它们的总和最多约为 2500 万。但只会出现 3 到 4 个不同的数字。
这不是他的意思——如果你先验地知道这些数字是什么，只需计算每个不同数字出现在其自己的存储桶中的次数，然后根据数字确定中位数就很简单了所有存储桶中的项目数。
为什么它不够快 oO 一般来说，从算法的角度来看，它比 O(ln n) 和 O(1) 快，我认为不可能更快。但是，如果提到特殊情况 Max Langhof，您只有很少的不同数字。您可以通过计算出现的次数来计算输入 O(1) 和 O(1)。
@VageEgiazarian 你不能在 O(1) 时间内遍历输入，原始算法也不是 O(ln N)。

标签： c++ performance data-structures big-o median

【解决方案1】：

如果字符串中只有三到四个不同的整数，您可以通过遍历字符串一次来跟踪每个整数出现的次数。从此表示中添加（和删除元素）也可以在恒定时间内完成。

class MedianFinder
{
public:
  MedianFinder(const std::vector<int>& inputString)
  {
    for (int element : inputString)
      _counts[element]++; // Inserts 0 into map if element is not in there.
  }

  void addStringEntry(int entry)
  {
    _counts[entry]++;
  }

  int getMedian() const
  {
    size_t numberOfElements = 0;
    for (auto kvp : _counts)
      numberOfElements += kvp.second;

    size_t cumulativeCount = 0;
    int lastValueBeforeMedian;
    for (auto kvp : _counts)
    {
      cumulativeCount += kvp.second;
      if (cumulativeCount >= numberOfElements/2)
        lastValueBeforeMedian = kvp.first;
    }

    // TODO! Handle the case of the median being in between two buckets.
    //return ...
  }

private:
  std::map<int, size_t> _counts;
};

中位数求和的琐碎任务在这里没有展示。

【讨论】：

【解决方案2】：

我不会像将复杂性从 O(n * log n) 降低到 O(n) 那样专注于优化。

您的算法是O(n * log n)，因为您执行n 插入，每个成本摊销O(log n) 时间。

有一个众所周知的O(n)algorithm for median finding。我建议使用这个。

通常log n 没什么大不了的，但是对于 2000 万个元素，它可以让你的算法快约 25 倍。

哦，我的错。我没有意识到只有 3-4 个不同的整数...

【讨论】：

虽然这是 general 情况的最佳解决方案，但考虑到 OP 特定问题的限制，它仍然是不必要的费力。
虽然这是最好的解决方案，如果我们不经常询问中位数并且经常插入。但一般来说，当插入和询问时间没有那么不同时，它的工作速度会慢得多。