在 CUDA 中保留具有集合相交的重复项答案

【问题标题】：Retain Duplicates with Set Intersection in CUDA在 CUDA 中保留具有集合相交的重复项
【发布时间】：2015-08-19 02:25:17
【问题描述】：

我正在使用 CUDA 和 THRUST 执行配对集合操作。不过，我想保留重复项。例如：

int keys[6] = {1, 1, 1, 3, 4, 5, 5};
int vals[6] = {1, 2, 3, 4, 5, 6, 7};
int comp[2] = {1, 5};

thrust::set_intersection_by_key(keys, keys + 6, comp, comp + 2, vals, rk, rv);

想要的结果

rk[1, 1, 1, 5, 5]
rv[1, 2, 3, 6, 7]

实际结果

rk[1, 5]
rv[5, 7]

我想要 comp 中包含相应 key 的所有 vals。

有没有办法使用推力来实现这一点，还是我必须编写自己的内核或推力函数？

我正在使用这个功能：set_intersection_by_key。

【问题讨论】：

是的：很抱歉。我会更新帖子 - documentation
集合交集保留重复，但只保留两个集合中的重复数。在您的示例中，没有重复项。听起来您根本不想要设置的交叉点，尽管我不确定您将如何称呼您描述的操作
而不是 intersection，这似乎是一种过滤操作，您可以根据谓词创建结果向量（谓词是键是否包含在comp)
我想要完成的类似于 SQL 内连接：一个过滤器排除第二组中不包含键的任何值。交叉点是我能想到的最接近的东西。也许过滤函数会更合适，但我在 CUDA 7.0 的推力版本中没有看到类似的东西。

标签： cuda thrust

【解决方案1】：

引用thrust documentation:

概括来说，如果一个元素在 [keys_first1, keys_last1) 中出现 m 次，在 [keys_first2, keys_last2) 中出现 n 次（其中 m 可能为零），那么它在 keys 输出中出现 min(m,n) 次范围

由于comp 只包含每个键一次，n=1 和因此min(m,1) = 1。

为了得到“comp中包含对应key的所有vals”，可以使用@的方法987654322@.

同样，示例代码执行以下步骤：

获取d_comp的最大元素。这假定 d_comp 已经排序。
创建大小为largest_element+1 的向量d_map。将1复制到d_map中d_comp条目的所有位置。

将d_vals 中所有在d_map 中有1 条目的条目复制到d_result 中。

#include <thrust/device_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/scatter.h>
#include <iostream>


#define PRINTER(name) print(#name, (name))
void print(const char* name, const thrust::device_vector<int>& v)
{
    std::cout << name << ":\t";
    thrust::copy(v.begin(), v.end(), std::ostream_iterator<int>(std::cout, "\t"));
    std::cout << std::endl;
}

int main()
{
    int keys[] = {1, 1, 1, 3, 4, 5, 5};
    int vals[] = {1, 2, 3, 4, 5, 6, 7};
    int comp[] = {1, 5};

    const int size_data = sizeof(keys)/sizeof(keys[0]);
    const int size_comp = sizeof(comp)/sizeof(comp[0]);

    // copy data to GPU
    thrust::device_vector<int> d_keys (keys, keys+size_data);
    thrust::device_vector<int> d_vals (vals, vals+size_data);
    thrust::device_vector<int> d_comp (comp, comp+size_comp);

    PRINTER(d_keys);
    PRINTER(d_vals);
    PRINTER(d_comp);

    int largest_element = d_comp.back();

    thrust::device_vector<int> d_map(largest_element+1);

    thrust::constant_iterator<int> one(1);
    thrust::scatter(one, one+size_comp, d_comp.begin(), d_map.begin());
    PRINTER(d_map);

    thrust::device_vector<int> d_result(size_data);
    using namespace thrust::placeholders;
    int final_size = thrust::copy_if(d_vals.begin(),
                                    d_vals.end(),
                                    thrust::make_permutation_iterator(d_map.begin(), d_keys.begin()),
                                    d_result.begin(),
                                    _1
                                    ) - d_result.begin();
    d_result.resize(final_size);

    PRINTER(d_result);

    return 0;
}

输出：

d_keys:     1   1   1   3   4   5   5   
d_vals:     1   2   3   4   5   6   7   
d_comp:     1   5   
d_map:      0   1   0   0   0   1   
d_result:   1   2   3   6   7

【讨论】：

这太棒了，真的帮助我理解了如何在推力中使用一些功能，但我认为这里可能有关于内存大小的警告。我希望密钥数以百万计。我的最大值目前是 2757476，这将需要大约 10 MB 的内存用于地图。地图可以改为字符类型吗？
不，10MB 没问题，但我尽量注意扩展问题，而且这些卡的内存非常宝贵；我可以看到这个放大了 10 倍。但是，使用 char 时，地图的大小将是 comp 的 1/4。如果comp变得那么大，我的工作记忆可能不会成为我最大的问题。不过，这个解决方案很棒，而且正是我所需要的。我可以稍后制定详细信息，然后发布我的代码和基准。我还将尝试一个内核，它对两组的每个可能的组合进行检查。没有内存开销，但会降低性能
@Robear 如果你想试验推力，你也可以从自定义函子中调用thrust::binary_search；这也将避免以计算开销为代价的任何内存开销
@Robear 你可以发布一个包含minimal reproducible example的新问题
@Robear 如果您有关于性能的具体问题，您应该提出一个新问题，其中包含您使用数据集、时间测量代码等的所有实现的完整示例。它应该可以通过复制和粘贴来重现。