GPU 内核中的指针和位运算符答案

【问题标题】：Pointers and bit operators in GPU kernelsGPU 内核中的指针和位运算符
【发布时间】：2013-05-23 02:35:01
【问题描述】：

我想使用 GPU 内核对卷执行双重阈值。我将每个切片的音量作为只读 image2d_t 发送。我的输出体积是二进制体积，其中 每个位 指定其相关体素是启用还是禁用。我的内核检查当前像素值是否在下/上限阈值范围内，并在二进制卷中启用其对应位。

出于调试目的，我暂时将实际检查留下了注释。我只是使用传递的切片 nr 来确定二进制音量位是否应该打开或关闭。前 14 个切片设置为“开”，其余设置为“关”。我还在 CPU 端验证了这段代码，我粘贴在这篇文章底部的代码。代码显示了两条路径，现在对 CPU 进行了注释。

CPU 代码按预期工作，应用二进制掩码渲染体积后返回以下图像：

使用我的 GPU 内核运行完全相同的逻辑会返回不正确的结果（第一个 3D，第二个切片视图）：

这里出了什么问题？我读到 OpenCL 不支持位字段，但据我从 OpenCL 规范中了解，它确实支持位运算符。我的位逻辑从 32 位字中选择正确的位并翻转它，是否受支持？还是我的简单标志被认为是位字段。它的作用是从左侧选择体素%32 位（不是右侧，因此是减法）。

另一件事可能是传递给我的内核的 uint 指针与我的预期不同。我认为这将是对指针的有效使用并将数据传递给我的内核。应用于内核中“uint* word”部分的逻辑是由于每行填充单词和每片填充行。 CPU变种确认指针计算逻辑是有效的。

下面；代码

            uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
            uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);

            int[] dims = new int[3];
            dims[0] = volume.Geometry.NumberOfVoxels.X;
            dims[1] = volume.Geometry.NumberOfVoxels.Y;
            dims[2] = volume.Geometry.NumberOfVoxels.Z;

            uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
            unsafe {
                fixed(int* dimPtr = dims) {
                    fixed(uint *arrcPtr = arrC) {
                        // pick Cloo Platform
                        ComputePlatform platform = ComputePlatform.Platforms[0];

                        // create context with all gpu devices
                        ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
                            new ComputeContextPropertyList(platform), null, IntPtr.Zero);

                        // load opencl source
                        StreamReader streamReader = new StreamReader(@"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
                        string clSource = streamReader.ReadToEnd();
                        streamReader.Close();

                        // create program with opencl source
                        ComputeProgram program = new ComputeProgram(context, clSource);

                        // compile opencl source
                        program.Build(null, null, null, IntPtr.Zero);

                        // Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
                        // Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
                        // For this reason their use should be avoided if possible.
                        ComputeEventList eventList = new ComputeEventList();

                        // Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
                        ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);

                        // Create the kernel function and set its arguments.
                        ComputeKernel kernel = program.CreateKernel("LowerThreshold");

                        int slicenr = 0;
                        foreach (IntPtr ptr in pinnedSlices) {
                            /*// CPU VARIANT FOR TESTING PURPOSES 
                            for (int y = 0; y < dims[1]; y++) {
                                for (int x = 0; x < dims[0]; x++) {
                                    long pixelOffset = x + y * dims[0];
                                    ushort* ushortPtr = (ushort*)ptr;
                                    ushort pixel = *(ushortPtr + pixelOffset);

                                    int BinaryWordShift = 5;
                                    int BinaryWordBits = 32;
                                    if (
                                        (0 <= x) &&
                                        (0 <= y) &&
                                        (0 <= slicenr) &&
                                        (x < dims[0]) &&
                                        (y < dims[1]) &&
                                        (slicenr < dims[2])
                                    ) {
                                        uint* word =
                                            arrcPtr + 1 + (slicenr * wordsPerPlane) +
                                            (y * wordsPerRow) +
                                            (x >> BinaryWordShift);
                                        uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
                                        //if (pixel > lowerThreshold && pixel < upperThreshold) {
                                        if (slicenr < 15) {
                                            *word |= mask;
                                        } else {
                                            *word &= ~mask;
                                        }
                                    }
                                }
                            }*/

                            ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
                                context,
                                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
                                3,
                                new IntPtr(dimPtr));

                            ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
                            ComputeImage2D image2D = new ComputeImage2D(
                                context, 
                                ComputeMemoryFlags.ReadOnly, 
                                format, 
                                volume.Geometry.NumberOfVoxels.X, 
                                volume.Geometry.NumberOfVoxels.Y, 
                                0, 
                                ptr
                            );

                            // The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
                            ComputeBuffer<uint> c = new ComputeBuffer<uint>(
                                context, ComputeMemoryFlags.WriteOnly, arrC.Length);

                            kernel.SetMemoryArgument(0, image2D);
                            kernel.SetMemoryArgument(1, dimsBuffer);
                            kernel.SetValueArgument(2, wordsPerRow);
                            kernel.SetValueArgument(3, wordsPerPlane);
                            kernel.SetValueArgument(4, slicenr);
                            kernel.SetValueArgument(5, lowerThreshold);
                            kernel.SetValueArgument(6, upperThreshold);
                            kernel.SetMemoryArgument(7, c);

                            // Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
                            // If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
                            commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);

                            // Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer 
                            // will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete 
                            // by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
                            // eventList will contain two events after this method returns.
                            commands.ReadFromBuffer(c, ref arrC, false, eventList);

                            // A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
                            // in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands 
                            // to finish has to be issued before "arrC" can be used. 
                            // This explicit synchronization can be achieved in two ways:
                            // 1) Wait for the events in the list to finish,
                            //eventList.Wait();
                            //}
                            // 2) Or simply use
                            commands.Finish();

                            slicenr++;
                        }

                    }
                }
            }

还有我的内核代码：

const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP |   CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
    read_only image2d_t image,
    global int* brickSize,
    uint wordsPerRow,
    uint wordsPerPlane,
    int slicenr,
    int lower,
    int upper,
    global write_only uint* c )
{

    int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
    uint4 pixel = read_imageui(image, smp, coord.xy);

    uchar BinaryWordShift = 5;
    int BinaryWordBits = 32;
    if (
            (0 <= coord.x) &&
            (0 <= coord.y) &&
            (0 <= coord.z) &&
            (coord.x < brickSize[0]) &&
            (coord.y < brickSize[1]) &&
            (coord.z < brickSize[2])
    ) {
        global uint* word =
            c + 1 + (coord.z * wordsPerPlane) +
            (coord.y * wordsPerRow) +
            (coord.x >> BinaryWordShift);

        uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
        //if (pixel.w > lower && pixel.w < upper) {
        if (slicenr < 15) {
            *word |= mask;
        } else {
            *word &= ~mask;
        }
    }
}

【问题讨论】：

你给的内核是哪一个？ CPU 还是 GPU？
同一个词是由多个线程同时写入的？这会产生危险，可能使用原子操作或使用不同的更新位策略
内核代码是GPU内核。内核的 CPU 实现在其上面的代码部分中被注释掉。是的，同一个 *word（因此传入的 *uint c）是由多个线程编写的。我预计底层的 opencl 在全局缓存写入级别会有某种 lock n 等待结构，但事实并非如此？
使用 atomic_or 代替 |=， atomic_and 代替 &=。虽然这应该保证正确性（如果没有其他错误），但效率低下，因为每个单词都会发生争用。

标签： c# opencl gpu cloo

【解决方案1】：

两个问题：

您已将“c”声明为“write_only”，但使用了“|=”和“&=”运算符，它们是读-修改-写
正如其他发帖者所提到的，如果两个工作项正在访问同一个单词，则 read-modify-write 之间存在竞争条件，这将导致错误。原子操作比非原子操作慢得多，所以虽然可能，但不推荐。

我建议您将输出放大 8 倍并使用字节而不是位。这将使您的输出只写，并且还会消除争用，从而消除竞争条件。

或者（如果数据紧凑性或格式很重要）每个工作项一次处理 8 个元素，并将复合 8 位输出写入单个字节。这将是只写的，没有争用，并且仍然具有您的数据紧凑性。

【讨论】：

选择这个作为答案，因为最后一部分正是我最终实现的。确实需要紧凑性，既不破坏其他功能，又不破坏内存消耗。谢谢。