编写部分和 GPU 内核答案

【问题标题】：Writing a Partial Sum GPU Kernel编写部分和 GPU 内核
【发布时间】：2018-12-14 13:35:25
【问题描述】：

我时不时有以下带有稀疏 1 的数组。它是一个巨大的向量，大小为兆字节

[0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 ..]

我需要将这些 1 存储在一个索引中进行处理，所以我需要一个产生这个的内核：

[0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 ..]

我怎样才能并行化这样的操作？

【问题讨论】：

"(parallel) prefix sum" 是你要搜索的词。

标签： opencl gpu

【解决方案1】：

您正在寻找“并行包含扫描”，thrust 库（随 cuda 工具包一起提供）包括开箱即用：

#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <iostream>

int main( int argc, char * argv[] )
{
    int data[17] = {0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0 };
    thrust::device_vector< int > in( data, data + 17 );
    thrust::device_vector< int > out( in.size() );

    thrust::inclusive_scan( in.begin(), in.end(), out.begin() );

    for ( int i = 0; i < out.size(); ++i )
        std::cout << out[i] << " ";
    std::cout << endl;
}

输出：

0 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2

或者您可以显式编写一个内核 - 这将只是 parallel prefix sum 算法的一种变体，它很好地概括了推力。

【讨论】：

opencl 需要它。有没有一个不错的函数或库
是的 - 有一个与推力几乎相同的“螺栓”：hsa-libraries.github.io/Bolt/html/group__CL-scan.html
哦，太好了。我目前将我的数据作为 cl::buffer。是否有演示此 api 使用的代码示例？特别是包含或排他的扫描功能
cl::buffer 到 bolt::cl::device_vector 到 bolt::cl::inclusive_scan。第一部分见hsa-libraries.github.io/Bolt/html/…。
图书馆似乎已经死了。最后一次更新是在 2014 年