多个排序数组的交集答案

【问题标题】：The intersection of multiple sorted arrays多个排序数组的交集
【发布时间】：2014-10-19 18:45:35
【问题描述】：

从this，我们知道了解决两个排序数组交集的方法。那么如何获取多个排序数组的交集呢？

基于两个排序数组的答案，我们可以将其应用于多个数组。这是代码

vector<int> intersectionVector(vector<vector<int> > vectors){
    int vec_num = vectors.size();

    vector<int> vec_pos(vec_num);// hold the current position for every vector
    vector<int> inter_vec; // collection of intersection elements

    while (true){
        int max_val = INT_MIN;
        for (int index = 0; index < vec_num; ++index){
            // reach the end of one array, return the intersection collection
            if (vec_pos[index] == vectors[index].size()){
                return inter_vec;
            }

            max_val = max(max_val, vectors[index].at(vec_pos[index]));
        }

        bool bsame = true;
        for (int index = 0; index < vec_num; ++index){
            while (vectors[index].at(vec_pos[index]) < max_val){
                vec_pos[index]++; // advance the position of vector, once less than max value
                bsame = false;
            }
        }

        // find same element in all vectors
        if (bsame){
            inter_vec.push_back(vectors[0].at(vec_pos[0]));

            // advance the position of all vectors
            for (int index = 0; index < vec_num; ++index){
                vec_pos[index]++;
            }
        }
    }
}

有没有更好的解决方法？

更新1

从1 和2 这两个主题来看，Hash set 似乎是更有效的方法。

更新2

为了提高性能，也许可以在上面的代码中使用min-heap 代替vec_pos。变量max_val 保存所有向量的当前最大值。所以只需将根值与max_val比较，如果相同，则可以将该元素放入交集列表中。

【问题讨论】：

标签： c++ arrays algorithm sorting

【解决方案1】：

要获得两个排序范围的交集，可以使用std::set_intersection：

std::vector<int> intersection (const std::vector<std::vector<int>> &vecs) {

    auto last_intersection = vecs[0];
    std::vector<int> curr_intersection;

    for (std::size_t i = 1; i < vecs.size(); ++i) {
        std::set_intersection(last_intersection.begin(), last_intersection.end(),
            vecs[i].begin(), vecs[i].end(),
            std::back_inserter(curr_intersection));
        std::swap(last_intersection, curr_intersection);
        curr_intersection.clear();
    }
    return last_intersection;
}

这看起来比您的解决方案要干净得多，因为您的解决方案太混乱而无法检查正确性。它还具有最佳复杂性。

标准库算法set_intersection可以用任何方式实现

最多 2·(N1+N2-1) 次比较，其中 N1 = std::distance(first1, last1) 和 N2 = std::distance(first2, last2)。

first1 等是定义输入范围的迭代器。如果标准库是开源的（如 libstd++ 或 libc++），您可以查看其源代码中的实际实现。

【讨论】：

好答案，@Baum mit Augen，我想知道函数set_intersection中使用的算法是否与我提到的相同？
@zangw 请确保这对您的问题足够有效。它将迭代每个容器两次而不是一次（甚至是它复制的第一个），这是“最佳”解决方案所做的。所以那里只有2倍的减速。它还分配第一个容器的大小，即使我们需要的数据远少于那个——另一方面，它不分配容器数量中的任何内容。
你们中的某个人真的衡量差异还是只是假设？（我并不是说我的解决方案是最快的，但即使 Alexandrescou 也说他需要测量所有内容，出于性能原因而放弃最简单的解决方案而没有实际数据对我来说似乎总是过早的优化。）
@tmyklebu，有什么可以参考的例子吗？
@BaummitAugen：是的，我已经实现了你所说的算法和疾驰搜索。疾驰搜索在相当多的工作负载上轻松获胜，因为它不会检查列表中的每个项目。我的主要观点是，虽然您描述的算法很简单并且您的实现很优雅，但集合交集是一个相当特殊的问题，因为我们可以得到一个线性最坏情况集合交集算法，在“典型”数据集上具有基本亚线性行为在许多领域。

【解决方案2】：

这假设您知道要相交的容器数量：

template<class Output, class... Cs>
Output intersect( Output out, Cs const&... cs ) {
  using std::begin; using std::end;
  auto its = std::make_tuple( begin(cs)... );
  const auto ends = std::make_tuple( end(cs)... );
  while( !at_end( its, ends ) ) {
    if ( all_same( its ) ) {
      *out++ = *std::get<0>(its);
      advance_all( its );
    } else {
      advance_least( its );
    }
  }
  return out;
}

简单实现即可完成：

bool at_end( std::tuple<Iterators...> const& its, std::tuple<Iterators...> const& ends );
bool all_same( std::tuple<Iterators...> const& its );
void advance_all( std::tuple<Iterators...>& its );
void advance_least( std::tuple<Iterators...>& its );

第一个很简单（使用索引技巧，成对比较，如果元组为空，检查是否返回 true）。

第二个类似。如果您比较 std::get<i>(its) == std::get<i+1>(its) 我认为而不是将所有内容与零进行比较，应该会更容易。可能需要一个空的特殊情况。

advance_all 更简单。

最后一个是棘手的。要求是您至少推进一个迭代器，并且您不推进取消引用最多的迭代器，并且您最多推进一次迭代器，并且您尽可能提高效率。

我想最简单的方法是找到最大的元素，将所有小于 1 的元素提前。

如果您不知道要相交的容器数量，可以将上述内容重构为使用动态存储进行迭代。这看起来与您自己的解决方案相似，除了将细节分解到子函数中。

【讨论】：