CUDA 中 2D 三角形平滑的正确网格和块尺寸是多少？答案

【问题标题】：What are the right Grid and Block Dimensions for 2D triangular smooth in CUDA?CUDA 中 2D 三角形平滑的正确网格和块尺寸是多少？
【发布时间】：2013-02-06 21:31:53
【问题描述】：

我有一个顺序平滑算法

void triangularSmooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter, NSTimer &timer, dim3 grid_size, dim3 block_size) {
for ( int y = 0; y < height; y++ ) {
    for ( int x = 0; x < width; x++ ) {
        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }

        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
    }
}
}

我在 CUDA 中实现并希望使用一个共享变量来保存 grayImage 中的像素。但是在此之前，我正在尝试按原样运行它。为此我有内核代码：

__global__ void smooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter)
{

        int x = blockIdx.x*blockDim.x + threadIdx.x;
        int y = blockIdx.y*blockDim.y + threadIdx.y;

        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }
        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
}

并调用：

const float filter[] = {1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 2.0f, 3.0f, 2.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f};
dim3 gridSize((width*height)/1024,(width*height)/1024,1);
dim3 blockSize(256,256,1);
smooth <<< gridSize, blockSize >>> (grayImage, smoothImage, width, height, filter);
cudaDeviceSynchronize();

问题在于，生成的平滑图像看起来像像素都在错误的其他位置（混合在一起）。这是来自网格和块的尺寸吗？我尝试了很多其他可能的尺寸。正确的方法是什么？

我正在使用 GTX480，版本 - 2.x，线程块网格的最大维度 - 3，线程块网格的最大 x、y 或 z 维度 - 65535，最大线程数每块 - 1024

【问题讨论】：

你的内核永远不会运行，因为块大小是非法的。如果您在代码中添加一些错误检查，您将看到内核启动失败并出现无效配置错误。
我有一个 cudaGetLastError();紧接在 cudaDeviceSynchronize() 之后；它不返回任何错误
请阅读this question and answer，了解在内核启动期间检查错误的正确方法。请注意，在您的问题中，您说过您的 GPU 的每个块有 1024 个线程的限制，并且您要求每个块有 256*256*1 个线程.....
检查不正确；你说的对！但是，我也收到 (16,16,1) 或 (32,32,1) 的错误
为了不完全浪费大家的时间，请将您的解决方案写成答案。您稍后将能够接受该答案，这会将问题标记为已回答

标签： c++ cuda

【解决方案1】：

首先，尺寸完全无效。在这种情况下，以下应该有效；

dim3 blockSize(16, 16, 1);
dim3 gridSize((width + blockSize.x - 1)/ blockSize.x, (height + blockSize.y - 1) / blockSize.y, 1);
smooth <<< grid_size, block_size >>> (grayImage, smoothImage, width, height);

修正后，使用 cuda-memcheck 产生类似的结果；

========= Invalid __global__ read of size 4
=========     at 0x00000120 in cudaFilter
=========     by thread (4,1,0) in block (1,0,0)
=========     Address 0x05100190 is out of bounds

这表明内核代码中的值超出范围（很可能是数组索引）。检查各种变量导致确定 filter[] 为空。

最后，如果要将 filter[] 传递给内核，则应使用类似的方法将其从 CPU 复制到 GPU

cudaMemcpy(filterGpu, filter, 25 * sizeof(float), cudaMemcpyHostToDevice);

或者，如果其他任何地方都不需要过滤器（就像这里的情况），它可以在内核中声明。

【讨论】：

【解决方案2】：

看this answer与图片过滤相关，我建议你像这样为图片创建块和网格：

dim3 blockSize(16,16,1);
dim3 gridSize((width + blockSize.x - 1)/blockSize.x,(height + blockSize.y - 1)/blockSize.y,1);

您犯的另一个非常常见的错误是您传递给内核的过滤器数组是在主机上分配的。在设备上创建一个相同大小的数组并将系数从主机复制到设备。将该设备数组传递给内核。

此外，强烈建议在主机端计算滤波器系数的总和并将其作为参数传递给内核，而不是在每个线程中一次又一次地计算总和。

边界条件可能会导致超出范围的内存访问。在内核中显式处理边界条件。或者简单的方法是对输入图像使用 CUDA 纹理，以便自动处理边界条件。

【讨论】：