OpenCL 意外的 CPU 利用率答案

【问题标题】：Unexpected CPU utilization with OpenCLOpenCL 意外的 CPU 利用率
【发布时间】：2018-01-30 15:51:55
【问题描述】：

我编写了一个简单的 OpenCL 内核来计算 GPU 上两个图像的互相关。但是，当我使用enqueueNDRangeKernel 执行内核时，一个内核的 CPU 使用率上升到 100%，但主机代码除了等待入队命令完成之外什么也不做。这是 OpenCL 程序的正常行为吗？那里发生了什么？

OpenCL 内核（如果相关）：

kernel void cross_correlation(global double *f,
                              global double *g,
                              global double *res) {
  // This work item will compute the cross-correlation value for pixel w
  const int2 w = (int2)(get_global_id(0), get_global_id(1));

  // Main loop
  int xy_index = 0;
  int xy_plus_w_index = w.x + w.y * X;

  double integral = 0;
  for ( int y = 0; y + w.y < Y; ++y ) {
    for ( int x = 0; x + w.x < X; ++x, ++xy_index, ++xy_plus_w_index ) {
      // xy_index is equal to x + y * X
      // xy_plus_w_index is equal to (x + w.x) + (y + w.y) * X
      integral += f[xy_index] * g[xy_plus_w_index];
    }

    xy_index += w.x;
    xy_plus_w_index += w.x;
  }

  res[w.x + w.y * X] = integral;
}

图像f, g, res 的大小为X 乘以Y 像素，其中X 和Y 在编译时设置。我正在使用X = 2048 和Y = 2048 测试上述内核。

附加信息：我正在使用 OpenCL 1.2 版的 Nvidia GPU 上运行内核。 C++ 程序是使用 OpenCL C++ Wrapper API 编写的，并使用 bumblebee 包中的 optirun 在 Debian 上执行。

根据要求，这是一个最小的工作示例：

#include <CL/cl.hpp>

#include <sstream>
#include <fstream>

using namespace std;

int main ( int argc, char **argv ) {
  const int X = 2048;
  const int Y = 2048;

  // Create context
  cl::Context context ( CL_DEVICE_TYPE_GPU );

  // Read kernel from file
  ifstream kernel_file ( "cross_correlation.cl" );
  stringstream buffer;
  buffer << kernel_file.rdbuf ( );
  string kernel_code = buffer.str ( );

  // Build kernel
  cl::Program::Sources sources;
  sources.push_back ( { kernel_code.c_str ( ), kernel_code.length ( ) } );
  cl::Program program ( context, sources );
  program.build ( " -DX=2048 -DY=2048" );

  // Allocate buffer memory
  cl::Buffer fbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
  cl::Buffer gbuf ( context, CL_MEM_READ_WRITE, X * Y * sizeof(double) );
  cl::Buffer resbuf ( context, CL_MEM_WRITE_ONLY, X * Y * sizeof(double) );

  // Create command queue
  cl::CommandQueue queue ( context );

  // Create kernel
  cl::Kernel kernel ( program, "cross_correlation" );

  kernel.setArg ( 0, fbuf );
  kernel.setArg ( 1, gbuf );
  kernel.setArg ( 2, resbuf );

  // Set input arguments
  double *f = new double[X*Y];
  double *g = new double[X*Y];

  for ( int i = 0; i < X * Y; i++ )
    f[i] = g[i] = 0.001 * i;

  queue.enqueueWriteBuffer ( fbuf, CL_TRUE, 0, X * Y * sizeof(double), f );
  queue.enqueueWriteBuffer ( gbuf, CL_TRUE, 0, X * Y * sizeof(double), g );

  // Execute kernel
  queue.enqueueNDRangeKernel ( kernel, cl::NullRange, cl::NDRange ( X, Y ), cl::NullRange, NULL, NULL );
  queue.finish ( );

  return 0;
}

【问题讨论】：

如何等待内核代码完成？ minimal reproducible example请
致电queue.finish ( );
您的设备的实现似乎正在 clFinish 中执行自旋等待循环。
这可能是真的，尽管对我来说这似乎是一种奇怪的实现 clFinish 的方式。

标签： c++ linux opencl

【解决方案1】：

您没有说如何调用 enqueueNDRangeKernel - 这是关键。据我了解，对于 NVidia，调用是阻塞的（尽管我认为它不应该是标准的一部分。）您可以通过让一个单独的线程调用 enqueueNDRangeKernel 并让该线程阻塞它而您的其他线程继续运行来解决此问题，并且阻塞线程可以在它完成时发出事件信号。

有一个关于它的讨论 here - 它提出了一些关于并行发生对队列的多次调用的警告。

【讨论】：

我对其进行了测试，对enqueueNDRangeKernel 的调用没有被阻塞，因此这似乎不是问题所在。我也只调用过一次这个函数。