【发布时间】:2013-02-04 10:27:33
【问题描述】:
以下内核计算声压场,每个线程计算它自己的 pressure 向量的私有实例,然后需要将其汇总到全局内存中。
我很确定计算 pressurevector 的代码是正确的,但我仍然无法让它产生预期的结果。
int gid = get_global_id(0);
int lid = get_local_id(0);
int nGroups = get_num_groups(0);
int groupSize = get_local_size(0);
int groupID = get_group_id(0);
/* Each workitem gets private storage for the pressure field.
* The private instances are then summed into local storage at the end.*/
private float2 pressure[HYD_DIM_TOTAL];
local float2 pressure_local[HYD_DIM_TOTAL];
/* Code which computes value of 'pressure' */
//wait for all workgroups to finish accessing any memory
barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
/// sum all results in a workgroup into local buffer:
for(i=0; i<groupSize; i++){
//each thread sums its own private instance into the local buffer
if (i == lid){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_local[iHyd] += pressure[iHyd];
}
}
//make sure all threads in workgroup get updated values of the local buffer
barrier(CLK_LOCAL_MEM_FENCE);
}
/// copy all the results into global storage
//1st thread in each workgroup writes the group's local buffer to global memory
if(lid == 0){
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[groupID +nGroups*iHyd] = pressure_local[iHyd];
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
/// sum the various instances in global memory into a single one
// 1st thread sums global instances
if(gid == 0){
for(iGroup=1; iGroup<nGroups; iGroup++){
//we only need to sum the results from the 1st group onward
for(iHyd=0; iHyd<HYD_DIM_TOTAL; iHyd++){
pressure_global[iHyd] += pressure_global[iGroup*HYD_DIM_TOTAL +iHyd];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
}
关于数据维度的一些说明:
线程总数将在 100 到 2000 之间变化,但有时可能会超出此区间。groupSize 将取决于硬件,但我目前使用的值介于 1(cpu) 和 32(gpu) 之间。HYD_DIM_TOTAL 在编译时是已知的,在 4 到 32 之间变化(通常但不一定是 2 的幂)。
这个缩减代码有什么明显的错误吗?
PS:我在带有 AMD APP SDK 2.8 的 i7 3930k 和 NVIDIA GTX580 上运行此程序。
【问题讨论】:
标签: opencl gpu gpgpu reduction amd-processor