在 GPU 和 CPU 上添加 2D 数组的性能答案

【问题标题】：Performance of adding 2D arrays on GPU and CPU在 GPU 和 CPU 上添加 2D 数组的性能
【发布时间】：2013-09-02 00:51:57
【问题描述】：

我目前正在使用 GPU 和在 CPU 上使用 C++ 来试验 OpenCL 代码的性能。我编写了计算总和 z = x + y 的程序，其中 z、x 和 y 是 GPU 和 CPU 的二维数组（矩阵）。在测试了这些程序后，我发现 CPU 在计算这个总和时比 GPU 效率更高，因为 GPU 和 CPU 之间的 PCI 总线中的数据传输速度很慢。现在我想确定需要多少总和才能使 GPU 的使用比 CPU 更高效。我计划通过将总和 z = x + y 增加到 z = x + y + y + y + y + ... 等等来做到这一点。

是否有可能仅通过增加此特定问题的总和数量来使使用 GPU 比使用 CPU 更高效？

仅供参考：我使用的是 nVIDIA GeForce GT 640 显卡和 i5 Intel 核心 CPU。

任何帮助将不胜感激。

编辑：

下面我将我的代码附在 CPU 上：

int main(int argc, const char * argv[])
{

    //This value determines the size of the nxn (square array)             
    int n = 1000;

    //Allocating the memory for the nxn arrays of floats.
    float **x = (float**)malloc(sizeof(float*)*n);
    float **y = (float**)malloc(sizeof(float*)*n);
    float **z = (float**)malloc(sizeof(float*)*n);


    //Initializing the arrays.
    for(int i = 0; i<n; i++){
        x[i] = (float*)malloc(sizeof(float)*n);
        y[i] = (float*)malloc(sizeof(float)*n);
        z[i] = (float*)malloc(sizeof(float)*n);

        for(int j = 0; j<n; j++){
            x[i][j] = i+j;
            y[i][j] = i+j;

        }
    }

    for(int i = 0; i<n; i++){
        for(int j = 0; j<n; j++){

            z[i][j] = x[i][j] + y[i][j];
            for(int k = 0; k < 100; k++){
                z[i][j] += y[i][j];
            }
        }
    }

    return 0;

}

这里是使用 OpenCL 的 C++：（用于复制数据并在 GPU 上执行内核）

int n = 1000;

for(int i = 0; i<n; i++)
    {
        //Writing the data from the host to the device
        err = clEnqueueWriteBuffer(queue, d_xx, CL_TRUE, 0, sizeof(float)*n, h_xx[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not write to buffer d_xx" << std::endl;
            exit(1);
        }

        err = clEnqueueWriteBuffer(queue, d_yy, CL_TRUE, 0, sizeof(float)*n, h_yy[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not write to buffer d_yy" << std::endl;
            exit(1);
        }

        //Setting the Kernel Arguments
        err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_xx);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_xx." << std::endl;
            exit(1);
        }

        err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_yy);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_yy." << std::endl;
            exit(1);
        }

        err = clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_zz);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_zz." << std::endl;
        }

        work_units_per_kernel = n;

        //Executing the Kernel
        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel, NULL, 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not execute kernel." << std::endl;
            exit(1);
        }

        //Reading the Data from the Kernel
        err = clEnqueueReadBuffer(queue, d_zz, CL_TRUE, 0, n*(sizeof(float)), h_zz[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not read data from kernel." << std::endl;
            exit(1);
        }

    }

最后是在 GPU 上执行的内核代码：

__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{

    int i = get_global_id(0);

    d_cc[i] = d_aa[i] + d_bb[i];


    for(int j = 0; j < 100; j++){
        d_cc[i] += d_bb[i];
    }


}

【问题讨论】：

你为什么不在这里发布你的代码？如果并行化，GPU 将获胜。 GPU也可能有更快的内存。你在你的程序中并行化吗？因为一个 cpu 线程将比一个 GPU“线程”执行得更好。
@SigTerm 感谢您的回复。我附上了我的代码的一些sn-ps。我希望他们能帮助澄清我是否在我的程序中进行了并行化。
在我看来，这是一个典型的案例，计算量太少，以至于整个操作都受内存限制，所以除非你可以在 GPU 上处理更多数据，否则总线速度会损失更多比 GPU 本身的收益。

标签： c++ performance opencl

【解决方案1】：

对于 n = 1000*1000，您已经到了值得复制、操作和复制回来的地步。正如 DarkZero 所指出的，全局内存不是最优的，因此如果您可以将全局内存缓存到本地内存或线程内存并使用本地工作组，这将对 CPU 和 GPU 都有很大帮助。

让我们从内核开始。 d_cc 在全局内存中被引用了 100 次。在这种情况下，一个简单的更改是将全局内存缓存到线程内存中，然后最后将本地复制回全局。

 __kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{

     int i = get_global_id(0);

     float t_d_cc = d_aa[i] + d_bb[i]; //make a thread only version of d_cc

     for(int j = 0; j < 100; j++){
         t_d_cc += d_bb[i];
     }

     d_cc[i] = t_d_cc; //copy the thread only back to global
}

另一个取决于硬件的变化是将 d_aa 和 d_bb 缓存到本地内存中。这让 OpenCL 可以利用全局内存中的批量复制。这可能更具挑战性，因为每个 OpenCL 设备都有不同的大小和可以使用的本地工作组大小的倍数。

例如，我的 i5 的最大工作组大小为 1024，工作组的倍数为 1，因此我的本地工作组可以是 1 到 1024 之间的任何值。我的 ATI-7970 的值分别为 256 和 64，所以我的本地工作组必须是 64、128 等。这要严格得多。

 __kernel void arraysum(__global const float *d_aa, 
                        __local float *l_d_aa,
                        __global const float *d_bb,
                        __local float *l_d_bb, 
                        __global float *d_cc,
                        __local float *l_d_cc)
{

//In this example, the global_id(1) is the number of rows and global_id(0) is the columns
//So when the kernel is called, the local work group size needs to be the size of the 
//number of columns

int i = get_global_id(1)*get_global_size(0) + get_global_id(0); //Index of the row
int j = get_local_id(0); 

l_d_aa[get_local_id(0)] = d_aa[i];
l_d_bb[get_local_id(0)] = d_bb[i];

read_mem_fence(CLK_LOCAL_MEM_FENCE);

float l_d_cc[get_local_id(0)] = l_d_aa[get_local_id(0)] + l_d_bb[get_local_id(0)]; 

for(int j = 0; j < get_global_size(0); j++){
    l_d_cc[get_local_id(0)] += l_d_bb[j];
}

d_cc[i] = l_d_cc[get_local_id(0)]; //copy the thread only back to global

}

如果我的算法有误，我深表歉意，但希望它传达了如何将全局内存缓存到本地内存。同样，在 i5 上，本地工作组大小可以是 1 到 1024，但 ATI7970 限制为 64、128 等列大小。

从概念上讲要困难得多，但使用这种方法时，OpenCL 的性能要好得多。

社区，请随时清理内核。

【讨论】：

这里使用本地内存没有多大意义，因为累加器是每个工作项的。但是，私有内存内核应该可以正常工作。内核 + 非阻塞调用 + 非常高的 N 值 = 非常好的速度。
我认为如果他可以将整行缓存到本地，他可以使用CPU，这是值得的。将编译器（gcc 或 MSVC）代码与 CPU 上具有缓存的 OpenCL 编译代码进行比较会很有趣。

【解决方案2】：

很多事情让你慢下来：

1- 滥用全局内存。每个全局内存访问都慢了 400 倍，而且你只使用全局内存（比如 200 次读/写）。全局内存只能用于开始读取，结束写入，绝不能作为中间值。

2- 你的 N 长度很短。 CPU 只需 1000 条指令即可完成，而 GPU 中的所有延迟都比这慢得多。因为 100MB 的副本比 1 字节的副本效率高得多，所以在复制操作中存在开销。

3- CPU 代码可能正在被编译器优化为乘法，而 GPU 代码却不能，因为它正在访问像全局变量这样的易失性变量。

4- 对设备的内存读/写非常昂贵，如果将其包含在计算中，CPU 将很容易获胜。 OpenCL 缓冲区和内核的创建也非常昂贵。请注意，您还使用了阻塞写调用，这比非阻塞调用慢得多。

【讨论】：