在 CUDA 中连续填充设备数组答案

【问题标题】：Fill device array consecutively in CUDA在 CUDA 中连续填充设备数组
【发布时间】：2018-03-25 21:16:00
【问题描述】：

（这可能更像是一个理论上的并行优化问题，而不是一个 CUDA 特定问题本身。我对并行编程很陌生，所以这可能只是个人无知。）

我有一个由 64 位二进制数组成的工作负载，我在其上运行分析。如果分析成功完成，则该二进制数是“有效解决方案”。如果分析中途中断，则该数字为“无效”。最终目标是获取所有有效解决方案的列表。

现在我正在分析数万亿个 64 位二进制数，但只有约 5% 或更少是有效的解决方案，而且它们通常成串出现（即每连续 1000 个数字都是有效的，然后是每十亿左右的随机数无效）。我找不到束之间空间的模式，所以我不能忽略大量无效的解决方案。

目前，内核调用中的每个线程只分析一个数字。如果该数字有效，则表示它在设备阵列上的相应位置。如果无效，同上。所以基本上我生成一个非常值得分析的数据点，无论它是否有效。然后，一旦阵列已满，我仅在找到有效解决方案时将其复制到主机（由设备上的标志表示）。这样，当数组的大小与网格中的线程数相同时，总体吞吐量最大。

但是从 GPU 复制内存到和从 GPU 是昂贵的时间明智的。也就是说，我想做的是仅在必要时复制数据；我想用有效的解决方案填充设备阵列，然后一旦阵列已满，然后从主机复制它。但是如何在并行环境中连续填充一个数组呢？还是我以错误的方式处理这个问题？

编辑 1

这是我最初开发的内核。如您所见，我为每个分析的值生成 1 个字节的数据。现在我真的只需要每个有效的 64 位数字；如果需要，我可以制作一个新内核。正如一些评论员所建议的那样，我目前正在研究流压缩。

__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
    //a 64 bit binary value to be evaluated is called a kValue
    unsigned long long int kStart, kEnd, kRoot, kSize, curK;
    //kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
    //kStart is the kValue to start this kernel call on
    //kEnd is the last kValue to validate
    //kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
    //curK is the current kValue represented as a 64 bit unsigned integer

    int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges; 

    kStart = kInfo[0];
    kEnd = kInfo[1];
    kRoot = kInfo[2];
    nodes = kInfo[3];
    edges = kInfo[4];
    kSize = kInfo[5];

    curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
    if (curK > kEnd) {//check to make sure you don't overshoot the end value
        return;
    }

    kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
    for (row = 0; row < nodes; row++) {
        rowCount = 0;
        kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
        for (col = 0; col < nodes; col++) {
            if (col > row) {
                if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
                    rowCount++;
                }
                kBitLocation++;
            }
            if (col < row) {
                if (col > 0) {
                    kMirrorBitLocation += (nodes - 2) - (col - 1);
                }
                if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
                    rowCount++;
                }
            }
        }
        if (rowCount != edges) {
            //set the ith bit to zero
            values[curK - kRoot] = 0;
            return;
        }
    }
    //set the ith bit to one
    values[curK - kRoot] = 1;
    *solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
}

【问题讨论】：

您的输出需要订购吗？也就是说，如果您的 64 位数字是 6、3、17、10，并且只有 3 和 10 有效，那么最终输出必须是 3 后跟 10 还是 10 后跟 3？另外，您需要输入中有效值的位置，还是只需要值本身？
你有一些你开始使用的示例 CUDA 代码吗？这样我们就可以给出更具体的代码答案。
关于你的最后一段，你所描述的被称为流压缩。来自贡献线程的数据被合并到一个缓冲区中以传输到主机。 thrust 和 cub 等库提供了执行此操作的方法。 this blog 也涵盖了这个想法，here 是一个相关问题。
@einpoklum 不，它确实需要订购，我只需要实际的 64 位值。
@Milhous 我会用源代码更新我的问题。

标签： parallel-processing cuda

【解决方案1】：

（此答案假设输出顺序无关紧要，有效值的位置也是如此。）

从概念上讲，您的分析会产生一组有效值。您描述的实现使用该集合的密集表示：每个潜在值一个位。然而您已经指出数据非常稀疏（5e-2 或 1000/10^9 = 1e-6）；此外，通过 PCI Express 复制数据非常痛苦。

那么，为什么不考虑稀疏表示呢？最简单的只是有效值的无序序列。当然，编写需要跨线程的一些同步 - 甚至可能跨块。粗略地说，您可以让 warp 在共享内存中收集它们的有效值；然后在块级别同步以收集块的有效值（对于已分析的给定输入块）；最后使用原子从所有块中收集数据。

哦，还有 - 让每个线程分析多个值，因此您不必进行那么多同步。

【讨论】：

【解决方案2】：

因此，您可能希望在从计算中返回之前让每个线程分析多个数字（数千或数百万）。因此，如果您在线程中分析一百万个数字，您只需要 %5 的空间即可保存该计算的结果。

【讨论】：