OpenCL 内核比普通 Java 循环慢答案

【问题标题】：OpenCL kernel slower than normal Java loopOpenCL 内核比普通 Java 循环慢
【发布时间】：2016-04-06 18:31:06
【问题描述】：

我一直在研究将 OpenCL 用于优化代码和并行运行任务，以实现比纯 Java 更快的速度。现在我有点问题。

我已经使用 LWJGL 编写了一个 Java 程序，据我所知，它应该能够完成几乎相同的任务——在这种情况下，将两个数组中的元素相加并将结果存储在另一个数组中——两种不同的方式：一种使用纯 Java，另一种使用 OpenCL 内核。我正在使用System.currentTimeMillis() 来跟踪每个元素需要多长时间来处理具有大量元素（~10,000,000）的数组。无论出于何种原因，纯 java 循环似乎执行了大约 3 到 10 次，具体取决于数组大小，比 CL 程序快。我的代码如下（省略导入）：

public class TestCL {

    private static final int SIZE = 9999999; //Size of arrays to test, this value is changed sometimes in between tests

    private static CLContext context; //CL Context
    private static CLPlatform platform; //CL platform
    private static List<CLDevice> devices; //List of CL devices
    private static CLCommandQueue queue; //Command Queue for context
    private static float[] aData, bData, rData; //float arrays to store test data

    //---Kernel Code---
    //The actual kernel script is here:
    //-----------------
    private static String kernel = "kernel void sum(global const float* a, global const float* b, global float* result, int const size){\n" + 
            "const int itemId = get_global_id(0);\n" + 
            "if(itemId < size){\n" + 
            "result[itemId] = a[itemId] + b[itemId];\n" +
            "}\n" +
            "}";;

    public static void main(String[] args){

        aData = new float[SIZE];
        bData = new float[SIZE];
        rData = new float[SIZE]; //Only used for CPU testing

        //arbitrary testing data
        for(int i=0; i<SIZE; i++){
            aData[i] = i;
            bData[i] = SIZE - i;
        }

        try {
            testCPU(); //How long does it take running in traditional Java code on the CPU?
            testGPU(); //How long does the GPU take to run it w/ CL?
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    /**
     * Test the CPU with pure Java code
     */
    private static void testCPU(){
        long time = System.currentTimeMillis();
        for(int i=0; i<SIZE; i++){
            rData[i] = aData[i] + bData[i];
        }
        //Print the time FROM THE START OF THE testCPU() FUNCTION UNTIL NOW
        System.out.println("CPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
    }

    /**
     * Test the GPU with OpenCL
     * @throws LWJGLException
     */
    private static void testGPU() throws LWJGLException {
        CLInit(); //Initialize CL and CL Objects

        //Create the CL Program
        CLProgram program = CL10.clCreateProgramWithSource(context, kernel, null);

        int error = CL10.clBuildProgram(program, devices.get(0), "", null);
        Util.checkCLError(error);

        //Create the Kernel
        CLKernel sum = CL10.clCreateKernel(program, "sum", null);

        //Error checker
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        //Floatbuffer for the first array of floats
        FloatBuffer aBuf = BufferUtils.createFloatBuffer(SIZE);
        aBuf.put(aData);
        aBuf.rewind();
        CLMem aMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //And the second
        FloatBuffer bBuf = BufferUtils.createFloatBuffer(SIZE);
        bBuf.put(bData);
        bBuf.rewind();
        CLMem bMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Memory object to store the result
        CLMem rMem = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, SIZE * 4, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Get time before setting kernel arguments
        long time = System.currentTimeMillis();

        sum.setArg(0, aMem);
        sum.setArg(1, bMem);
        sum.setArg(2, rMem);
        sum.setArg(3, SIZE);

        final int dim = 1;
        PointerBuffer workSize = BufferUtils.createPointerBuffer(dim);
        workSize.put(0, SIZE);

        //Actually running the program
        CL10.clEnqueueNDRangeKernel(queue, sum, dim, null, workSize, null, null, null);
        CL10.clFinish(queue);

        //Write results to a FloatBuffer
        FloatBuffer res = BufferUtils.createFloatBuffer(SIZE);
        CL10.clEnqueueReadBuffer(queue, rMem, CL10.CL_TRUE, 0, res, null, null);

        //How long did it take?
        //Print the time FROM THE SETTING OF KERNEL ARGUMENTS UNTIL NOW
        System.out.println("GPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));

        //Cleanup objects
        CL10.clReleaseKernel(sum);
        CL10.clReleaseProgram(program);
        CL10.clReleaseMemObject(aMem);
        CL10.clReleaseMemObject(bMem);
        CL10.clReleaseMemObject(rMem);

        CLCleanup();
    }

    /**
     * Initialize CL objects
     * @throws LWJGLException
     */
    private static void CLInit() throws LWJGLException {
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        CL.create();

        platform = CLPlatform.getPlatforms().get(0);
        devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
        context = CLContext.create(platform, devices, eBuf);
        queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, eBuf);

        Util.checkCLError(eBuf.get(0));
    }

    /**
     * Cleanup after CL completion
     */
    private static void CLCleanup(){
        CL10.clReleaseCommandQueue(queue);
        CL10.clReleaseContext(context);
        CL.destroy();
    }

}

以下是来自各种测试的一些示例控制台结果：

CPU processing time for 10000000 elements: 24
GPU processing time for 10000000 elements: 88

CPU processing time for 1000000 elements: 7
GPU processing time for 1000000 elements: 10

CPU processing time for 100000000 elements: 193
GPU processing time for 100000000 elements: 943

我的编码是否有问题导致 CL 速度变快，或者在这种情况下这实际上是意料之中的吗？如果是后者，那么什么时候CL更可取？

【问题讨论】：

简单地将两个数组相加并不足以抵消将内存复制到显卡并返回到普通内存的开销。尝试使操作更加计算密集。
@Tony Ruth 谢谢，这似乎是原因。
为了提高效率，您应该始终移除 GPU 管道中的阻塞调用（以避免 CPU 干预）。删除CL10.clFinish(queue); 行，因为下面的读取已经是一个阻塞调用。

标签： java performance opencl gpu lwjgl

【解决方案1】：

我修改了测试来做一些我认为计算量比简单加法更昂贵的事情。

关于 CPU 测试，行：

rData[i] = aData[i] + bData[i];

改为：

rData[i] = (float)(Math.sin(aData[i]) * Math.cos(bData[i]));

在 CL 内核中，行：

result[itemId] = a[itemId] + b[itemId];

改为：

result[itemId] = sin(a[itemId]) * cos(b[itemId]);

我现在得到控制台结果，例如：

CPU processing time for 1000000 elements: 154
GPU processing time for 1000000 elements: 11

CPU processing time for 10000000 elements: 8699
GPU processing time for 10000000 elements: 98

（对于 100000000 个元素的测试，CPU 花费的时间比我想打扰的要长。）

为了检查准确性，我添加了比较rData 和res 的任意元素的检查，以确保它们相同。我在这里省略了结果，因为只要说它们相等就足够了。

现在函数更复杂了（两个三角函数相乘），看起来 CL 内核比纯 Java 循环更有效。

【讨论】：