【发布时间】:2016-04-06 18:31:06
【问题描述】:
我一直在研究将 OpenCL 用于优化代码和并行运行任务,以实现比纯 Java 更快的速度。现在我有点问题。
我已经使用 LWJGL 编写了一个 Java 程序,据我所知,它应该能够完成几乎相同的任务——在这种情况下,将两个数组中的元素相加并将结果存储在另一个数组中——两种不同的方式:一种使用纯 Java,另一种使用 OpenCL 内核。我正在使用System.currentTimeMillis() 来跟踪每个元素需要多长时间来处理具有大量元素(~10,000,000)的数组。无论出于何种原因,纯 java 循环似乎执行了大约 3 到 10 次,具体取决于数组大小,比 CL 程序快。我的代码如下(省略导入):
public class TestCL {
private static final int SIZE = 9999999; //Size of arrays to test, this value is changed sometimes in between tests
private static CLContext context; //CL Context
private static CLPlatform platform; //CL platform
private static List<CLDevice> devices; //List of CL devices
private static CLCommandQueue queue; //Command Queue for context
private static float[] aData, bData, rData; //float arrays to store test data
//---Kernel Code---
//The actual kernel script is here:
//-----------------
private static String kernel = "kernel void sum(global const float* a, global const float* b, global float* result, int const size){\n" +
"const int itemId = get_global_id(0);\n" +
"if(itemId < size){\n" +
"result[itemId] = a[itemId] + b[itemId];\n" +
"}\n" +
"}";;
public static void main(String[] args){
aData = new float[SIZE];
bData = new float[SIZE];
rData = new float[SIZE]; //Only used for CPU testing
//arbitrary testing data
for(int i=0; i<SIZE; i++){
aData[i] = i;
bData[i] = SIZE - i;
}
try {
testCPU(); //How long does it take running in traditional Java code on the CPU?
testGPU(); //How long does the GPU take to run it w/ CL?
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/**
* Test the CPU with pure Java code
*/
private static void testCPU(){
long time = System.currentTimeMillis();
for(int i=0; i<SIZE; i++){
rData[i] = aData[i] + bData[i];
}
//Print the time FROM THE START OF THE testCPU() FUNCTION UNTIL NOW
System.out.println("CPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
}
/**
* Test the GPU with OpenCL
* @throws LWJGLException
*/
private static void testGPU() throws LWJGLException {
CLInit(); //Initialize CL and CL Objects
//Create the CL Program
CLProgram program = CL10.clCreateProgramWithSource(context, kernel, null);
int error = CL10.clBuildProgram(program, devices.get(0), "", null);
Util.checkCLError(error);
//Create the Kernel
CLKernel sum = CL10.clCreateKernel(program, "sum", null);
//Error checker
IntBuffer eBuf = BufferUtils.createIntBuffer(1);
//Floatbuffer for the first array of floats
FloatBuffer aBuf = BufferUtils.createFloatBuffer(SIZE);
aBuf.put(aData);
aBuf.rewind();
CLMem aMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuf, eBuf);
Util.checkCLError(eBuf.get(0));
//And the second
FloatBuffer bBuf = BufferUtils.createFloatBuffer(SIZE);
bBuf.put(bData);
bBuf.rewind();
CLMem bMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuf, eBuf);
Util.checkCLError(eBuf.get(0));
//Memory object to store the result
CLMem rMem = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, SIZE * 4, eBuf);
Util.checkCLError(eBuf.get(0));
//Get time before setting kernel arguments
long time = System.currentTimeMillis();
sum.setArg(0, aMem);
sum.setArg(1, bMem);
sum.setArg(2, rMem);
sum.setArg(3, SIZE);
final int dim = 1;
PointerBuffer workSize = BufferUtils.createPointerBuffer(dim);
workSize.put(0, SIZE);
//Actually running the program
CL10.clEnqueueNDRangeKernel(queue, sum, dim, null, workSize, null, null, null);
CL10.clFinish(queue);
//Write results to a FloatBuffer
FloatBuffer res = BufferUtils.createFloatBuffer(SIZE);
CL10.clEnqueueReadBuffer(queue, rMem, CL10.CL_TRUE, 0, res, null, null);
//How long did it take?
//Print the time FROM THE SETTING OF KERNEL ARGUMENTS UNTIL NOW
System.out.println("GPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
//Cleanup objects
CL10.clReleaseKernel(sum);
CL10.clReleaseProgram(program);
CL10.clReleaseMemObject(aMem);
CL10.clReleaseMemObject(bMem);
CL10.clReleaseMemObject(rMem);
CLCleanup();
}
/**
* Initialize CL objects
* @throws LWJGLException
*/
private static void CLInit() throws LWJGLException {
IntBuffer eBuf = BufferUtils.createIntBuffer(1);
CL.create();
platform = CLPlatform.getPlatforms().get(0);
devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
context = CLContext.create(platform, devices, eBuf);
queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, eBuf);
Util.checkCLError(eBuf.get(0));
}
/**
* Cleanup after CL completion
*/
private static void CLCleanup(){
CL10.clReleaseCommandQueue(queue);
CL10.clReleaseContext(context);
CL.destroy();
}
}
以下是来自各种测试的一些示例控制台结果:
CPU processing time for 10000000 elements: 24
GPU processing time for 10000000 elements: 88
CPU processing time for 1000000 elements: 7
GPU processing time for 1000000 elements: 10
CPU processing time for 100000000 elements: 193
GPU processing time for 100000000 elements: 943
我的编码是否有问题导致 CL 速度变快,或者在这种情况下这实际上是意料之中的吗?如果是后者,那么什么时候CL更可取?
【问题讨论】:
-
简单地将两个数组相加并不足以抵消将内存复制到显卡并返回到普通内存的开销。尝试使操作更加计算密集。
-
@Tony Ruth 谢谢,这似乎是原因。
-
为了提高效率,您应该始终移除 GPU 管道中的阻塞调用(以避免 CPU 干预)。删除
CL10.clFinish(queue);行,因为下面的读取已经是一个阻塞调用。
标签: java performance opencl gpu lwjgl