在 NVIDIA 和 Intel GPU 上启动内核时的 OpenCL CL_INVALID_COMMAND_QUEUE答案

【问题标题】：OpenCL CL_INVALID_COMMAND_QUEUE when launching Kernel at Both NVIDIA and Intel GPUs在 NVIDIA 和 Intel GPU 上启动内核时的 OpenCL CL_INVALID_COMMAND_QUEUE
【发布时间】：2019-02-19 15:55:57
【问题描述】：

这可能不是最狭窄的问题，但是..

该程序实现了所有 OpenCL 内容的包装器。包装器检测所有 OpenCL 设备，然后将它们包装到另一个包装器中。设备包装器包含与其相关的所有对象，例如分配的 cl_mem 缓冲区关联的上下文等。

如果没有错误，如果没有指针被重用，例如由于某些错误，来自不同平台的设备包装器将共享相同的平台指针，我已经多次检查。但不是。

问题：当我在笔记本电脑上的所有计算设备（CPU+Intel GPU + Nvidia GPU）之间分配工作时，发给 NVIDIA GPU 的内核执行会因 CL_INVALID_COMMAND_QUEUE 而崩溃。

我已经检查了所有内容。

我尝试了以下场景：

Intel GPU 和 CPU 同时运行 => 一切正常
同时使用两个 CPU（服务器）=> 一切正常
如果我在笔记本电脑上混合来自两个平台的设备 => 它会因 CL_INVALID_COMMAND_QUEUE 而崩溃。它仅在 Nvidia GPU 上崩溃。

大部分初始化代码如下。

std::cout << "Initializing the OpenCL engine..\n";
cl_int ret;
unsigned int nrOfActiveContexts = 0;
ret = clGetPlatformIDs(0, NULL, &mRetNumPlatforms);
if (mRetNumPlatforms > 0)
{

    this->mPlatforms.resize(mRetNumPlatforms);
}
else
{
    fprintf(stderr, "No OpenCL platform available.\n");
    exit(1);
}

ret = clGetPlatformIDs(mRetNumPlatforms, mPlatforms.data(), NULL);

std::vector<cl_device_id> devices;
cl_context context;
cl_uint numberOfDevices;
//query for available compute platforms
for (int i = 0; i < mPlatforms.size() ; i++)
{
    bool error = false;
    numberOfDevices = 0;
    devices.clear();
    context = NULL;
    cl_device_type deviceTypes = CL_DEVICE_TYPE_ALL;
    if (useCPU &&useGPU)
        deviceTypes = CL_DEVICE_TYPE_ALL;
    else if (useCPU)
        deviceTypes = CL_DEVICE_TYPE_CPU;
    else if (useGPU)
        deviceTypes = CL_DEVICE_TYPE_GPU;

    ret = clGetDeviceIDs(mPlatforms[i], deviceTypes, 0, NULL, &numberOfDevices);
    if (numberOfDevices > 0)
    {
        devices.resize( numberOfDevices);
        ret = clGetDeviceIDs(mPlatforms[i], deviceTypes,
            numberOfDevices, devices.data(), NULL);
    }
    else continue;

    context = clCreateContext(NULL, numberOfDevices, devices.data(), NULL, NULL, &ret);
    if (ret != CL_SUCCESS)
        throw(std::abort);
    mContexts.push_back(context);
    if (ret != CL_SUCCESS)
    {
        error = true;
    }
    //query device properties create Workers
    size_t ret_size;
    cl_uint compute_units;
    cl_ulong max_alloc;
    size_t max_work_size;
    std::string name;
    std::vector<char> c_name;
    for (int y = 0; y < devices.size(); y++)
    {
        ret_size = compute_units = max_alloc = max_work_size = 0;
        c_name.clear();

        ret = clGetDeviceInfo(devices[y], CL_DEVICE_NAME, NULL, NULL, &ret_size);
        if (ret != CL_SUCCESS)
        {
            error = true; goto errored;
        }
        c_name.resize(ret_size);
        ret = clGetDeviceInfo(devices[y], CL_DEVICE_NAME, c_name.size(), c_name.data(), &ret_size);
        if (ret != CL_SUCCESS)
        {
            error = true; goto errored;
        }
        name = std::string(c_name.begin(), c_name.end());
        name = std::regex_replace(name, std::regex("[' ']{2,}"), " ");

        cl_device_type   devType;
        ret = clGetDeviceInfo(devices[y], CL_DEVICE_TYPE, sizeof(cl_device_type), (void *)&devType, NULL);
        if (ret != CL_SUCCESS)
        {
            error = true; goto errored;
        }



        ret = clGetDeviceInfo(devices[y], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_uint), (void *)&compute_units, NULL);
        if (ret != CL_SUCCESS)
        {
            error = true; goto errored;
        }
        ret = clGetDeviceInfo(devices[y], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), (void *)&max_work_size, NULL);
        if (ret != CL_SUCCESS)
        {
            error = true; goto errored;
        }
        CWorker::eWorkerType type;
        if (devType & CL_DEVICE_TYPE_GPU)
            type = CWorker::eWorkerType::GPU;
        else
            if (devType & CL_DEVICE_TYPE_CPU)
                type = CWorker::eWorkerType::CPU;


        if (type == CWorker::eWorkerType::CPU)
        {
            if (compute_units > 8)
                max_work_size = compute_units / 4;
            else if (compute_units == 8)
                max_work_size = 2;
            else
                max_work_size = 1;

        }
        ret = clGetDeviceInfo(devices[y], CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), (void *)&max_alloc, NULL);
        if (ret != CL_SUCCESS)
        {
            error = true;
            goto errored;
        }
    errored:
        if (error != true)
        {
            CWorker  * w = new CWorker();

            w->setDevice(devices[y]);
            w->setMaxComputeUnits(compute_units);
            w->setMaxMemAlloc(max_alloc);
            w->setMaxWorkGroupSize(max_work_size);
            w->setName(name);
            std::cmatch cm;
            if (std::regex_search(name.data(), cm, std::regex("\\w\+")))
                w->setShortName(std::string(cm[0]) +"-"+ std::to_string(mWorkers.size()+1));
            w->setContext(context);
            w->setType(type);

            mWorkers.push_back(w);
        }
    }

    nrOfActiveContexts++;
}

if (mWorkers.size() > 0)
    mInitialised = true;
if (mWorkers.size() > 0)
    return true;
else return false;

【问题讨论】：

内核虽然非常复杂，但其实是纯 C。
顺便说一句，我很想解决这个问题，所以..每个工作人员都会执行多个内核，我在每次调用之后都放了 clFinish 以确定它失败的地方。 nvidia GPU 在第一个上失败:)
huseyin tugrul buyukisik;我可能夸大了内核的复杂性。它们的复杂之处在于它们执行的加密功能在程序复杂性的意义上并没有那么复杂。内核共享一个大内存缓冲区。每个拥有函数构成一个内核，它们与屏障同步（CLK_GLOBAL_MEM_FENCE）
在有两个平台（每个平台有一个 CPU）的服务器上一切正常
NVidia 独占鳌头

标签： c++ c++11 opencl

【解决方案1】：

很可能问题出在如何创建上下文：

context = clCreateContext(NULL, numberOfDevices, devices.data(), NULL, NULL, &ret);

传递的第一个参数是NULL，根据OpenCL manual，表示选择的平台是实现定义的：

指定上下文属性名称及其对应值的列表。每个属性名称后面紧跟相应的所需值。该列表以 0 结尾。属性可以为 NULL，在这种情况下，选择的平台是实现定义的。

尝试传递这样的东西：

cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[i])(), 0};
context = clCreateContext(properties, numberOfDevices, devices.data(), NULL, NULL, &ret);

如果这没有帮助，那么可以尝试先初始化 Nvidia（如果还没有的话）。可能是 Intel 先初始化，并且它的 OpenCL 版本驱动程序比 Nvidia 更新（例如 Intel OpenCL 2.0 vs Nvidia 1.2），其中一些用于 Nvidia，因此出错。

【讨论】：