【发布时间】:2013-05-23 02:35:01
【问题描述】:
我想使用 GPU 内核对卷执行双重阈值。我将每个切片的音量作为只读 image2d_t 发送。我的输出体积是二进制体积,其中 每个位 指定其相关体素是启用还是禁用。我的内核检查当前像素值是否在下/上限阈值范围内,并在二进制卷中启用其对应位。
出于调试目的,我暂时将实际检查留下了注释。我只是使用传递的切片 nr 来确定二进制音量位是否应该打开或关闭。前 14 个切片设置为“开”,其余设置为“关”。我还在 CPU 端验证了这段代码,我粘贴在这篇文章底部的代码。代码显示了两条路径,现在对 CPU 进行了注释。
CPU 代码按预期工作,应用二进制掩码渲染体积后返回以下图像:
使用我的 GPU 内核运行完全相同的逻辑会返回不正确的结果(第一个 3D,第二个切片视图):
这里出了什么问题?我读到 OpenCL 不支持位字段,但据我从 OpenCL 规范中了解,它确实支持位运算符。我的位逻辑从 32 位字中选择正确的位并翻转它,是否受支持?还是我的简单标志被认为是位字段。它的作用是从左侧选择体素%32 位(不是右侧,因此是减法)。
另一件事可能是传递给我的内核的 uint 指针与我的预期不同。我认为这将是对指针的有效使用并将数据传递给我的内核。应用于内核中“uint* word”部分的逻辑是由于每行填充单词和每片填充行。 CPU变种确认指针计算逻辑是有效的。
下面;代码
uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);
int[] dims = new int[3];
dims[0] = volume.Geometry.NumberOfVoxels.X;
dims[1] = volume.Geometry.NumberOfVoxels.Y;
dims[2] = volume.Geometry.NumberOfVoxels.Z;
uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
unsafe {
fixed(int* dimPtr = dims) {
fixed(uint *arrcPtr = arrC) {
// pick Cloo Platform
ComputePlatform platform = ComputePlatform.Platforms[0];
// create context with all gpu devices
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
new ComputeContextPropertyList(platform), null, IntPtr.Zero);
// load opencl source
StreamReader streamReader = new StreamReader(@"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
string clSource = streamReader.ReadToEnd();
streamReader.Close();
// create program with opencl source
ComputeProgram program = new ComputeProgram(context, clSource);
// compile opencl source
program.Build(null, null, null, IntPtr.Zero);
// Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
// Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
// For this reason their use should be avoided if possible.
ComputeEventList eventList = new ComputeEventList();
// Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
// Create the kernel function and set its arguments.
ComputeKernel kernel = program.CreateKernel("LowerThreshold");
int slicenr = 0;
foreach (IntPtr ptr in pinnedSlices) {
/*// CPU VARIANT FOR TESTING PURPOSES
for (int y = 0; y < dims[1]; y++) {
for (int x = 0; x < dims[0]; x++) {
long pixelOffset = x + y * dims[0];
ushort* ushortPtr = (ushort*)ptr;
ushort pixel = *(ushortPtr + pixelOffset);
int BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= x) &&
(0 <= y) &&
(0 <= slicenr) &&
(x < dims[0]) &&
(y < dims[1]) &&
(slicenr < dims[2])
) {
uint* word =
arrcPtr + 1 + (slicenr * wordsPerPlane) +
(y * wordsPerRow) +
(x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
//if (pixel > lowerThreshold && pixel < upperThreshold) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
}*/
ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
3,
new IntPtr(dimPtr));
ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
ComputeImage2D image2D = new ComputeImage2D(
context,
ComputeMemoryFlags.ReadOnly,
format,
volume.Geometry.NumberOfVoxels.X,
volume.Geometry.NumberOfVoxels.Y,
0,
ptr
);
// The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
ComputeBuffer<uint> c = new ComputeBuffer<uint>(
context, ComputeMemoryFlags.WriteOnly, arrC.Length);
kernel.SetMemoryArgument(0, image2D);
kernel.SetMemoryArgument(1, dimsBuffer);
kernel.SetValueArgument(2, wordsPerRow);
kernel.SetValueArgument(3, wordsPerPlane);
kernel.SetValueArgument(4, slicenr);
kernel.SetValueArgument(5, lowerThreshold);
kernel.SetValueArgument(6, upperThreshold);
kernel.SetMemoryArgument(7, c);
// Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
// If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);
// Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer
// will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete
// by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
// eventList will contain two events after this method returns.
commands.ReadFromBuffer(c, ref arrC, false, eventList);
// A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
// in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands
// to finish has to be issued before "arrC" can be used.
// This explicit synchronization can be achieved in two ways:
// 1) Wait for the events in the list to finish,
//eventList.Wait();
//}
// 2) Or simply use
commands.Finish();
slicenr++;
}
}
}
}
还有我的内核代码:
const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP | CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
read_only image2d_t image,
global int* brickSize,
uint wordsPerRow,
uint wordsPerPlane,
int slicenr,
int lower,
int upper,
global write_only uint* c )
{
int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
uint4 pixel = read_imageui(image, smp, coord.xy);
uchar BinaryWordShift = 5;
int BinaryWordBits = 32;
if (
(0 <= coord.x) &&
(0 <= coord.y) &&
(0 <= coord.z) &&
(coord.x < brickSize[0]) &&
(coord.y < brickSize[1]) &&
(coord.z < brickSize[2])
) {
global uint* word =
c + 1 + (coord.z * wordsPerPlane) +
(coord.y * wordsPerRow) +
(coord.x >> BinaryWordShift);
uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
//if (pixel.w > lower && pixel.w < upper) {
if (slicenr < 15) {
*word |= mask;
} else {
*word &= ~mask;
}
}
}
【问题讨论】:
-
你给的内核是哪一个? CPU 还是 GPU?
-
同一个词是由多个线程同时写入的?这会产生危险,可能使用原子操作或使用不同的更新位策略
-
内核代码是GPU内核。内核的 CPU 实现在其上面的代码部分中被注释掉。是的,同一个 *word(因此传入的 *uint c)是由多个线程编写的。我预计底层的 opencl 在全局缓存写入级别会有某种 lock n 等待结构,但事实并非如此?
-
使用 atomic_or 代替 |=, atomic_and 代替 &=。虽然这应该保证正确性(如果没有其他错误),但效率低下,因为每个单词都会发生争用。