c++：有没有更快的方法来获取地图/无序地图的交集？答案

【问题标题】：c++ : Is there a faster way to get the intersection of map/unordered_map?c++：有没有更快的方法来获取地图/无序地图的交集？
【发布时间】：2020-08-11 10:37:51
【问题描述】：

在 c++ 中是否有更快的方法来实现以下功能，以便我可以超越 python 的实现？

获取两个 map/unordered_map 键的交集
对于这些相交的键，计算它们各自 set/unordered_set 的元素之间的成对差异一些可能有用的信息：
hash_DICT1 有大约 O(10000) 个键，每个集合中大约有 O(10) 个元素。
hash_DICT2 有大约 O(1000) 个键，每个集合中大约有 O(1) 个元素。

例如：

    map <int,set<int>> hash_DICT1;
        hash_DICT1[1] = {1,2,3};
        hash_DICT1[2] = {4,5,6};
    map <int,set<int>> hash_DICT2;
        hash_DICT2[1] = {11,12,13};
        hash_DICT2[3] = {4,5,6};

    vector<int> output_vector
        = GetPairDiff(hash_DICT1, hash_DICT2)
        = [11-1,12-1,13-1, 
           11-2,12-2,13-2, 
           11-3,12-3,13-3] // only hashkey=1 is intersect, so only compute pairwise difference of the respective set elements.
        = [10, 11, 12, 
            9, 10, 11, 
            8,  9, 10] // Note that i do want to keep duplicates, if any. Order does not matter.

GetPairDiff 函数。

    vector<int> GetPairDiff(
    unordered_map <int, set<int>> &hash_DICT1,
    unordered_map <int, set<int>> &hash_DICT2) {
      // Init
        vector<int> output_vector;
        int curr_key;
        set<int> curr_set1, curr_set2;

      // Get intersection
        for (const auto &KEY_SET:hash_DICT2) {
          curr_key = KEY_SET.first;
          // Find pairwise difference
          if (hash_DICT1.count(curr_key) > 0){
            curr_set1 = hash_DICT1[curr_key];
            curr_set2 = hash_DICT2[curr_key];
            for (auto it1=curr_set1.begin(); it1 != curr_set1.end(); ++it1) {
              for (auto it2=curr_set2.begin(); it2 != curr_set2.end(); ++it2) {
                output_vector.push_back(*it2 - *it1);
              }
            }
          }
        }
    }

主运行

    int main (int argc, char ** argv) {
        // Using unordered_map
        unordered_map <int,set<int>> hash_DICT_1;
            hash_DICT_1[1] = {1,2,3};
            hash_DICT_1[2] = {4,5,6};
        unordered <int,set<int>> hash_DICT_2;
            hash_DICT_2[1] = {11,12,13};
            hash_DICT_2[3] = {4,5,6};
        GetPairDiff(hash_DICT_1, hash_DICT_1);
    }

这样编译

g++ -o ./CompareRunTime.out -Ofast -Wall -Wextra -std=c++11

欢迎使用其他数据结构，例如map 或unordered_set。但是我确实尝试了所有 4 种排列，发现 GetPairDiff 给出的排列速度最快，但远不及 python 的实现：

hash_DICT1 = { 1 : {1,2,3},      2 : {4,5,6} }
hash_DICT2 = { 1 : {11,12,13},   3 : {4,5,6} }

def GetPairDiff(hash_DICT1, hash_DICT2):
    vector = []
    for element in hash_DICT1.keys() & hash_DICT2.keys():
        vector.extend(
            [db_t-qry_t 
            for qry_t in hash_DICT2[element] 
            for db_t in hash_DICT1[element] ])
    return vector

output_vector = GetPairDiff(hash_DICT1, hash_DICT2)

性能对比：

python  : 0.00824 s
c++     : 0.04286 s

用c++实现大约是耗时的5倍！！！

【问题讨论】：

@TedLyngmo 感谢您的评论，我已相应更新。我可以知道你到底在哪里使用const&。另外，我如何用find 替换count 的用法？
当然，我做了一个答案来显示它。

标签： python c++ dictionary set intersection

【解决方案1】：

您在应该使用const& 的地方进行了大量复制。
您不保存搜索结果。您应该使用find 而不是count，然后使用结果。
push_back 到 vector 可以更快地通过 reserve() 确定您需要存储的元素数量，如果您提前知道数量。

修复这些问题可能会导致如下结果（需要 C++17）：

#include <iostream>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>

using container = std::unordered_map<int, std::unordered_set<int>>;

std::vector<int> GetPairDiff(const container& hash_DICT1,
                             const container& hash_DICT2) {
    // Init
    std::vector<int> output_vector;

    // Get intersection
    for(auto& [curr_key2, curr_set2] : hash_DICT2) {
        // use find() instead of count()
        if(auto it1 = hash_DICT1.find(curr_key2); it1 != hash_DICT1.end()) {
            auto& curr_set1 = it1->second;

            // Reserve the space you know you'll need for this iteration. Note:
            // This might be a pessimizing optimization so try with and without it.
            output_vector.reserve(curr_set1.size() * curr_set2.size() +
                                  output_vector.size());

            // Calculate pairwise difference
            for(auto& s1v : curr_set1) {
                for(auto& s2v : curr_set2) {
                    output_vector.emplace_back(s2v - s1v);
                }
            }
        }
    }
    return output_vector;
}

int main() {
    container hash_DICT1{{1, {1, 2, 3}}, 
                         {2, {4, 5, 6}}};
    container hash_DICT2{{1, {11, 12, 13}},
                         {3, {4, 5, 6}}};

    auto result = GetPairDiff(hash_DICT1, hash_DICT2);

    for(int v : result) {
        std::cout << v << '\n';
    }
}

这比我电脑上使用g++ -std=c++17 -O3编译的这些容器的python版本快8倍以上。

这是同一程序的 C++11 版本：

#include <iostream>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>

using container = std::unordered_map<int, std::unordered_set<int>>;

std::vector<int> GetPairDiff(const container& hash_DICT1,
                             const container& hash_DICT2) {
    // Init
    std::vector<int> output_vector;

    // Get intersection
    for(auto& curr_pair2 : hash_DICT2) {
        auto& curr_key2 = curr_pair2.first;
        auto& curr_set2 = curr_pair2.second;
        // use find() instead of count()
        auto it1 = hash_DICT1.find(curr_key2);
        if(it1 != hash_DICT1.end()) {
            auto& curr_set1 = it1->second;

            // Reserve the space you know you'll need for this iteration. Note:
            // This might be a pessimizing optimization so try with and without it.
            output_vector.reserve(curr_set1.size() * curr_set2.size() +
                                  output_vector.size());

            // Calculate pairwise difference
            for(auto& s1v : curr_set1) {
                for(auto& s2v : curr_set2) {
                    output_vector.emplace_back(s2v - s1v);
                }
            }
        }
    }
    return output_vector;
}

int main() {
    container hash_DICT1{{1, {1, 2, 3}}, 
                         {2, {4, 5, 6}}};
    container hash_DICT2{{1, {11, 12, 13}},
                         {3, {4, 5, 6}}};

    auto result = GetPairDiff(hash_DICT1, hash_DICT2);

    for(int v : result) {
        std::cout << v << '\n';
    }
}

【讨论】：

我试过你的函数，用g++ -std=c++17 -O3编译，但是对于for(const auto& [curr_key2, curr_set2] : hash_DICT2) {这一行，我得到以下错误error: expected unqualified-id before ‘[’ tokenerror: expected ‘;’ before ‘[’ tokenerror: ‘curr_key2’ was not declared in this scope
@leonardltk1 好的，我怀疑您的 g++ --version 介于 5.1 和 6.4 之间，它支持 C++17，但显然没有 structured bindings。我在答案的底部添加了一个 C++11 版本。
谢谢，现在可以使用了！当您使用c++17 的解决方案与c++11 的解决方案相比时，速度是否存在差异？
@leonardltk1 我对此表示怀疑。我认为编译器的成熟度比语言版本更能影响这一点。顺便说一句，我刚刚进行了性能测试，设置了每组 10000 个键/10 个值和每组 1000 个键/1 个值与 100 个键的交集（只是猜测）。 quick-bench.com/q/Zhy6QIL_4ASlSIb2ab1mfo6-h8c 表明我建议的reserve 是真的破坏性能。最好在开始时制作 one reserve - 或完全删除 reserve。
哈哈是的，我确实尝试过。它从 0.001 秒跃升至 6 秒 >