我是否正确地重用了 OpenCL/Cloo(C#) 对象？答案

【问题标题】：Am I reusing OpenCL/Cloo(C#) objects correctly?我是否正确地重用了 OpenCL/Cloo(C#) 对象？
【发布时间】：2017-07-06 01:43:23
【问题描述】：

我正在试验 OpenCL（通过 Cloo 的 C# 接口）。为此，我正在尝试使用 GPU 上的常规矩阵乘法。问题是，在我的速度测试期间，应用程序崩溃了。我正在努力提高各种 OpenCL 对象的重新分配效率，我想知道我这样做是否在搞砸一些事情。

我将代码放在这个问题中，但是为了更大的图景，您可以从 github 获取代码：https://github.com/kwende/ClooMatrixMultiply

我的主程序是这样做的：

        Stopwatch gpuSw = new Stopwatch();
        gpuSw.Start();
        for (int c = 0; c < NumberOfIterations; c++)
        {
            float[] result = gpu.MultiplyMatrices(matrix1, matrix2, MatrixHeight, MatrixHeight, MatrixWidth);
        }
        gpuSw.Stop();

所以我基本上是在调用 NumberOfIterations 次，并计算平均执行时间。

在 MultiplyMatrices 调用中，第一次通过时，我调用 Initialize 来设置我要重用的所有对象：

    private void Initialize()
    {
        // get the intel integrated GPU
        _integratedIntelGPUPlatform = ComputePlatform.Platforms.Where(n => n.Name.Contains("Intel")).First();

        // create the compute context. 
        _context = new ComputeContext(
            ComputeDeviceTypes.Gpu, // use the gpu
            new ComputeContextPropertyList(_integratedIntelGPUPlatform), // use the intel openCL platform
            null,
            IntPtr.Zero);

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        _commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        string kernelSource = null;
        using (StreamReader sr = new StreamReader("kernel.cl"))
        {
            kernelSource = sr.ReadToEnd();
        }

        // create the "program"
        _program = new ComputeProgram(_context, new string[] { kernelSource });

        // compile. 
        _program.Build(null, null, null, IntPtr.Zero);
        _kernel = _program.CreateKernel("ComputeMatrix");
    }

然后我进入我的函数的主体（将执行 NumberOfIterations 次的部分）。

         ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly| ComputeMemoryFlags.CopyHostPointer,
                matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.CopyHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width ,matrix1Height },
            null, null);

        unsafe
        {
            fixed (float* retPtr = ret)
            {
                _commandQueue.Read(retBuffer,
                    false, 0,
                    ret.Length,
                    new IntPtr(retPtr),
                    null);

                _commandQueue.Finish();
            }
        }

第三次或第四次（它有点随机，暗示内存访问问题），程序崩溃。这是我的内核（我确信有更快的实现，但现在我的目标只是让某些东西在不崩溃的情况下工作）：

kernel void ComputeMatrix(
    global read_only float* matrix1,
    global read_only float* matrix2,
    global write_only float* output, 
    int matrix1WidthMatrix2Height,
    int matrix2Width)
{
    int x = get_global_id(0); 
    int y = get_global_id(1); 
    int i = y * matrix2Width + x; 

    float value = 0.0f; 
    // row y of matrix1 * column x of matrix2
    for (int c = 0; c < matrix1WidthMatrix2Height; c++)
    {
        int m1Index = y * matrix1WidthMatrix2Height + c;
        int m2Index = c * matrix2Width + x;

        value += matrix1[m1Index] * matrix2[m2Index]; 
    }
    output[i] = value; 
}

这里的最终目标是更好地理解 OpenCL 的零拷贝特性（因为我使用的是英特尔的集成 GPU）。我一直无法让它工作，所以想退后一步，看看我是否理解更基本的东西......显然我不明白，因为我什至无法让它在不爆炸的情况下工作。

唯一的其他事情我能想到的是我如何固定指针以将其发送到 .Read() 函数。但我不知道有什么替代方案。

编辑：

为了它的价值，我将代码的最后一部分（读取的代码）更新为这个，它仍然崩溃：

_commandQueue.ReadFromBuffer(retBuffer, ref ret, false, null);
_commandQueue.Finish();

编辑#2

huseyin tugrul buyukisik 找到的解决方案（请参阅下面的评论）。

放置后

matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();

最后，一切正常。

【问题讨论】：

一种可能是您在不释放缓冲区资源的情况下重新创建，这会超过 opencl 限制并崩溃。一种可能性， ret.Length 以字节为单位，需要乘以 sizeof(float) 或更好的 sizeof(cl_float) 。需要处理opencl资源。如果您在不破坏它们的情况下重复使用（超出范围时），则不必一次又一次地设置参数。如果它们的范围仅为gpu.MultiplyMatrices(，则应将缓冲区的创建移至 init 部分。是 opencl 2.0 还是 1.2 ？
你成功了，我的朋友。我将 Dispose 调用放在函数末尾以释放缓冲区并修复它。事实证明，GC 并没有跟上 GPU 内存的负载。
GC 不应该被信任。必须有一些using(){} 实现或显式取消分配。也许 C# 更可靠，但 java 有问题。
太棒了。我现在看到了。我在上面的答案中给了你信用:) 再次感谢，我的朋友。
另外，通过将假值参数替换为真值来进行缓冲区读/写阻塞可能比之后添加完成命令更快。

标签： c# opencl cloo

【解决方案1】：

像缓冲区、内核和命令队列这样的 OpenCl 资源应该在它们绑定到的其他资源被释放之后被释放。在不释放的情况下重新创建会很快耗尽可用的插槽。

您一直在使用 gpu 的方法重新创建数组，这就是 opencl 缓冲区的范围。完成后，GC 无法跟踪 opencl 的非托管内存区域，这会导致泄漏，从而导致崩溃。

许多 opencl 实现使用 C++ 绑定，这需要 C#、Java 和其他环境的显式释放命令。

当重复的内核执行使用与内核参数完全相同的缓冲区顺序时，也不需要多次设置参数部分。

【讨论】：

您对零拷贝的概念有多熟悉？最终我要做的是防止复制缓冲区（我使用的是英特尔集成 GPU，因此 GPU 和“主机”CPU 共享相同的地址空间）。上面需要改变什么来支持它？我还需要照原样释放资源吗？
据我所知，map/unmap 是零拷贝（使用 use_host_ptr）。您正在制作缓冲区副本。此外，您应该查询设备参数以了解它有自己的内存或真正共享 cpu 内存
好的。我将对此进行试验，如果我不能让它工作，也许会提出另一个 StackOverflow 问题。我找到了许多示例，但它们很难阅读。我很可能会发一个新帖子，所以我会保持这个帖子的正轨。你对我最后几个 StackOverflow 问题很有帮助。我真的很感激。
映射通常比复制更难，但对于流媒体场景来说更快。尤其是使用带有设备对齐值的 use_host_ptr，例如 4096 的倍数