C ++如何将已排序的向量合并为已排序的向量/从所有这些向量中弹出最少的元素？答案

【问题标题】：C++ How to merge sorted vectors into a sorted vector / pop the least element from all of them?C ++如何将已排序的向量合并为已排序的向量/从所有这些向量中弹出最少的元素？
【发布时间】：2012-02-19 06:33:01
【问题描述】：

我有大约一百个左右排序的vector<int> 的集合虽然大多数向量中都有少量整数，但其中一些向量包含大量 (>10K) 个整数（因此向量没有' t 必须具有相同的大小）。

我想做的基本上是遍历所有这些排序向量中包含的从最小到最大的整数。

一种方法是将所有这些排序的向量合并为一个排序的向量并简单地迭代。因此，

问题 1： 将已排序的向量合并为已排序的向量的最快方法是什么？

另一方面，我确信有更快/更聪明的方法来完成此任务，而无需合并和重新排序整个事情——也许从这个排序向量集合中迭代地弹出最小整数；没有先合并它们..所以：

问题 2：从一堆已排序的vector<int>'s 中弹出最少元素的禁食/最佳方法是什么？

根据下面的回复，以及对问题的 cmet，我已经实现了一种方法，我为排序的向量创建了迭代器的优先级队列。我不确定这是否具有性能效率，但它似乎非常节省内存。我认为这个问题仍然悬而未决，因为我不确定我们是否已经建立了最快的方式。

// compare vector pointers by integers pointed
struct cmp_seeds {
    bool operator () (const pair< vector<int>::iterator, vector<int>::iterator> p1, const pair< vector<int>::iterator, vector<int>::iterator> p2) const {
        return *(p1.first) >  *(p2.first);      
    }
};

int pq_heapsort_trial() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1 (a1, a1 + sizeof(a1) / sizeof(int));
    vector<int> v2 (a2, a2 + sizeof(a2) / sizeof(int));
    vector<int> v3 (a3, a3 + sizeof(a3) / sizeof(int));

    vector< vector <int> * > sorted_vectors;
    sorted_vectors.push_back(&v1);
    sorted_vectors.push_back(&v2);
    sorted_vectors.push_back(&v3);
    /* the above simulates the "for" i have in my own code that gives me sorted vectors */

    pair< vector<int>::iterator, vector<int>::iterator> c_lead;
    cmp_seeds mycompare;

    priority_queue< pair< vector<int>::iterator, vector<int>::iterator>, vector<pair< vector<int>::iterator, vector<int>::iterator> >, cmp_seeds> cluster_feeder(mycompare);


    for (vector<vector <int> *>::iterator k = sorted_vectors.begin(); k != sorted_vectors.end(); ++k) {
        cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ));
    }


    while ( cluster_feeder.empty() != true) {
        c_lead = cluster_feeder.top();
        cluster_feeder.pop();
        // sorted output
        cout << *(c_lead.first) << endl;

        c_lead.first++;
        if (c_lead.first != c_lead.second) {
            cluster_feeder.push(c_lead);
        }
    }

    return 0;
}

【问题讨论】：

1) 如果空间不是问题，则将 CS101 中的排序范围标准合并到一个新向量中（或者只是想一想，然后做显而易见的事情）。 2）在你到处乱扔东西之前，确保你了解标准容器的复杂性保证；修改std::vector 通常相当昂贵。 3) 停止使用 t'he apo'strophes！
@Kerrek-SB 谢谢，稍微修正了格式——我很高兴简单地将向量合并成一个更大的向量和排序；但我想知道是否有更快的方法来做到这一点。
不不，您执行排序合并。想一想，有一种明显的方法可以利用输入范围的顺序来创建已经排序的输出范围。
@Kerrek-SB 我想我明白你的意思了，我知道如何对两个排序向量使用常规合并函数；这可以递归/迭代地工作吗？如何对超过 2 个排序向量进行“多重合并”？
使用优先级队列（堆）来存储向量的第一个元素。

标签： c++ sorting vector mergesort processing-efficiency

【解决方案1】：

首先想到的是创建一个堆结构，其中包含每个向量的迭代器，按它们当前指向的值排序。（当然，每个条目也需要包含结束迭代器）

当前元素位于堆的根部，要前进，您只需将其弹出或增加其键即可。（后者可以通过弹出、递增、然后推送来完成）

我认为这应该具有渐近复杂度O(E log M) 其中E 是元素的总数，M 是向量的数量。

如果您真的要从向量中弹出所有内容，您可以创建一个指向向量的指针堆，您可能也希望将它们视为堆，以避免从向量前面擦除的性能损失。（或者，您可以先将所有内容复制到deques）

如果您注意顺序，通过一次合并对将它们合并在一起具有相同的渐近复杂度。如果您将所有向量排列在一个完整、平衡的二叉树中，然后在树上向上进行成对合并，那么每个元素将被复制log M 次，也会导致O(E log M) 算法。

为了提高实际效率，您应该重复合并最小的两个向量，而不是树，直到只剩下一个。（同样，将指向向量的指针放在堆中是可行的方法，但这次按长度排序）

（真的，您想按“复制成本”而不是长度来排序。为某些值类型优化的额外内容）

如果我不得不猜测，最快的方法是使用第二个想法，但使用 N 元合并而不是成对合并，对于一些合适的 N（我猜这将是一个小常数，或者大致是向量个数的平方根），然后使用上面的第一种算法进行N元合并，一次枚举N个向量的内容。

【讨论】：

当然，对于专门的数据，最好进行线性时间排序；例如直方图或桶排序或基数排序。
谢谢你的回答，我是比较新的，你能提供一些示例代码用于说明目的吗？ (1) 如何进行 N 元合并？（2）“堆结构如何包含每个向量的迭代器，按它们当前指向的值排序。（当然，每个条目也需要包含结束迭代器）当前元素位于堆的根部，并且要前进，您只需弹出它，或者增加它的键。（后者可以通过弹出、递增、然后推送来完成）”查看代码？

【解决方案2】：

一种选择是使用std :: priority queue 来维护迭代器堆，其中迭代器根据它们指向的值在堆中冒泡。

您也可以考虑使用std :: inplace_merge 的重复应用程序。这将涉及将所有数据一起附加到一个大向量中，并记住每个不同排序块开始和结束的偏移量，然后将它们传递到 inplace_merge。这可能会比堆解决方案更快，尽管我认为从根本上来说复杂性是相同的。

更新：我已经实现了我刚才描述的第二种算法。反复就地进行合并排序。此代码位于ideone。

这是通过首先将所有排序列表连接到一个长列表中来实现的。如果有三个源列表，这意味着有四个“偏移量”，它们是完整列表中的四个点，元素在这些点之间进行排序。然后，该算法将一次提取其中三个，将两个相应的相邻排序列表合并为一个排序列表，然后记住这三个偏移中的两个以用于 new_offsets。

这会在一个循环中重复，将成对的相邻排序范围合并在一起，直到只剩下一个排序范围。

最终，我认为最好的算法是首先将最短的相邻范围对合并在一起。

// http://stackoverflow.com/questions/9013485/c-how-to-merge-sorted-vectors-into-a-sorted-vector-pop-the-least-element-fro/9048857#9048857
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
using namespace std;

template<typename T, size_t N>
vector<T> array_to_vector( T(*array)[N] ) { // Yes, this works. By passing in the *address* of
                                            // the array, all the type information, including the
                                            // length of the array, is known at compiler. 
        vector<T> v( *array, &((*array)[N]));
        return v;
}   

void merge_sort_many_vectors() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1  = array_to_vector(&a1);
    vector<int> v2  = array_to_vector(&a2);
    vector<int> v3  = array_to_vector(&a3);


    vector<int> full_vector;
    vector<size_t> offsets;
    offsets.push_back(0);

    full_vector.insert(full_vector.end(), v1.begin(), v1.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v2.begin(), v2.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v3.begin(), v3.end());
    offsets.push_back(full_vector.size());

    assert(full_vector.size() == v1.size() + v2.size() + v3.size());

    cout << "before:\t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }       
    cout << endl;
    while(offsets.size()>2) {
            assert(offsets.back() == full_vector.size());
            assert(offsets.front() == 0);
            vector<size_t> new_offsets;
            size_t x = 0;
            while(x+2 < offsets.size()) {
                    // mergesort (offsets[x],offsets[x+1]) and (offsets[x+1],offsets[x+2])
                    inplace_merge(&full_vector.at(offsets.at(x))
                                 ,&full_vector.at(offsets.at(x+1))
                                 ,&(full_vector[offsets.at(x+2)]) // this *might* be at the end
                                 );
                    // now they are sorted, we just put offsets[x] and offsets[x+2] into the new offsets.
                    // offsets[x+1] is not relevant any more
                    new_offsets.push_back(offsets.at(x));
                    new_offsets.push_back(offsets.at(x+2));
                    x += 2;
            }
            // if the number of offsets was odd, there might be a dangling offset
            // which we must remember to include in the new_offsets
            if(x+2==offsets.size()) {
                    new_offsets.push_back(offsets.at(x+1));
            }
            // assert(new_offsets.front() == 0);
            assert(new_offsets.back() == full_vector.size());
            offsets.swap(new_offsets);

    }
    cout << "after: \t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }
    cout << endl;
}

int main() {
        merge_sort_many_vectors();
}

【讨论】：

感谢 Aaron，实现了第一个建议并发布了代码——有什么建议吗？如果我有时间做 inplace_merge 将再次更新。
@Deniz，你的 priority_queue 算法看起来不错。我现在在这里更新了我的答案，以包含我的第二个算法的实现，其中相邻的排序范围对重复合并排序在一起，直到只剩下一个范围。
@AaronMcDaid 我用不同的输入尝试了上面的程序，结果没有按顺序排列。输入：int a1[] = { 30, 50, 3, 8}; int a2[] = { 11, 14, 19, 6, 8, 30}; int a3[] = { 8, 6 };输出：11、14、19、6、8、30、30、50、3、8、6、8
@SyncMaster，问题假设输入向量已经排序。但是您提供的每个向量 not 已经排序。所以我认为我的程序对于这个问题仍然是正确的。如果目标只是合并多个 unsorted 向量，那么解决方案就是连接向量，然后对其运行标准 std::sort。但是这里的目标是利用输入已经排序的事实，并利用这个事实来获得更快的排序。

【解决方案3】：

我使用了这里给出的算法并做了一些抽象；转换为模板。我在 VS2010 中编写了这个版本，并使用了 lambda 函数而不是仿函数。我不知道这在某种意义上是否比以前的版本“更好”，但也许它会对某人有用？

#include <queue>
#include <vector>

namespace priority_queue_sort
{
    using std::priority_queue;
    using std::pair;
    using std::make_pair;
    using std::vector;

    template<typename T>
    void value_vectors(const vector< vector <T> * >& input_sorted_vectors, vector<T> &output_vector)
    {
        typedef vector<T>::iterator iter;
        typedef pair<iter, iter>    iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;
        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }

    template<typename U, typename V>
    void pair_vectors(const vector< vector < pair<U, V> > * >& input_sorted_vectors, vector< pair<U, V> > &output_vector)
    {
        typedef vector< pair<U, V> >::iterator iter;
        typedef pair<iter, iter> iter_pair;

        static auto greater_than_lambda = [](const iter_pair& p1, const iter_pair& p2) -> bool { return *(p1.first) >  *(p2.first); };

        priority_queue<iter_pair, std::vector<iter_pair>, decltype(greater_than_lambda) > cluster_feeder(greater_than_lambda);

        size_t total_size(0);

        for (auto k = input_sorted_vectors.begin(); k != input_sorted_vectors.end(); ++k)
        {
            cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ) );
            total_size += (*k)->size();
        }

        output_vector.resize(total_size);
        total_size = 0;
        iter_pair c_lead;

        while (cluster_feeder.empty() != true)
        {
            c_lead = cluster_feeder.top();
            cluster_feeder.pop();
            output_vector[total_size++] = *(c_lead.first);  
            c_lead.first++;
            if (c_lead.first != c_lead.second) cluster_feeder.push(c_lead);
        }
    }
}

算法priority_queue_sort::value_vectors 只对包含值的向量进行排序；而priority_queue_sort::pair_vectors 根据第一个数据元素对包含数据对的向量进行排序。希望有一天有人可以使用它:-)

【讨论】：

当输入排序向量之一为空时，这会出现错误。您可以在添加到 cluster_feeder 之前先检查一下