简单图像处理示例中的 C++AMP 异常答案

【问题标题】：C++AMP exception in simple image processing example简单图像处理示例中的 C++AMP 异常
【发布时间】：2014-03-19 21:14:51
【问题描述】：

我正在尝试自学 C++AMP，并想从我所在领域的一个非常简单的任务开始，那就是图像处理。我想将每像素 24 位的 RGB 图像（位图）转换为每像素 8 位的灰度图像。图像数据在unsigned char数组中可用（从Bitmap::LockBits(...)等获取）

我知道C++AMP由于某种原因无法通过array或array_view处理char或unsigned char数据，所以我尝试根据that blog使用textures。 Here 解释了如何写入 8bpp 纹理，尽管 VisualStudio 2013 告诉我 writeonly_texture_view 已弃用。

我的代码抛出运行时异常，提示“未能调度内核”。异常的完整文本很长：

ID3D11DeviceContext::Dispatch：计算着色器单元插槽 0 中的无序访问视图 (UAV) 具有格式 (R8_UINT)。这种格式不支持像 UAV 一样从着色器中读取。如果着色器实际使用视图，则这种不匹配是无效的（例如，由于着色器代码分支而没有跳过它）。不幸的是，不可能让所有硬件实现都支持将此格式作为 UAV 读取，尽管该格式可以作为 UAV 写入。如果着色器只需要对该资源执行读取而不是写入，请考虑使用着色器资源视图而不是 UAV。

我目前使用的代码是这样的：

namespace gpu = concurrency;

gpu::extent<3> inputExtent(height, width, 3);
gpu::graphics::texture<unsigned int, 3> inputTexture(inputExtent, eight);
gpu::graphics::copy((void*)inputData24bpp, dataLength, inputTexture);
gpu::graphics::texture_view<unsigned int, 3> inputTexView(inputTexture);
gpu::graphics::texture<unsigned int, 2> outputTexture(width, height, eight);
gpu::graphics::writeonly_texture_view<unsigned int, 2> outputTexView(outputTexture);

gpu::parallel_for_each(outputTexture.extent,
    [inputTexView, outputTexView](gpu::index<2> pix) restrict(amp) {
    gpu::index<3> indR(pix[0], pix[1], 0);
    gpu::index<3> indG(pix[0], pix[1], 1);
    gpu::index<3> indB(pix[0], pix[1], 2);
    unsigned int sum = inputTexView[indR] + inputTexView[indG] + inputTexView[indB];
    outputTexView.set(pix, sum / 3);
});

gpu::graphics::copy(outputTexture, outputData8bpp);

此异常的原因是什么，我可以采取什么解决方法？

【问题讨论】：

unsigned int eight = 8; one 也一样。

标签： c++ image-processing visual-studio-2013 gpgpu c++-amp

【解决方案1】：

我也一直在自学 C++Amp，遇到了与您非常相似的问题，但就我而言，我需要处理 16 位图像。

这个问题可能可以使用纹理来解决，但由于缺乏经验，我无法帮助您。

所以，我所做的基本上是基于位掩码。

首先，欺骗编译器让你编译：

unsigned int* sourceData = reinterpret_cast<unsigned int*>(source);
unsigned int* destData   = reinterpret_cast<unsigned int*>(dest);

接下来，您的数组查看器必须查看您的所有数据。请注意，您的数据实际上是 32 位大小的。因此，您必须进行转换（因为 16 位，所以除以 2，8 位使用 4）。

concurrency::array_view<const unsigned int> source( (size+ 7)/2, sourceData) );
concurrency::array_view<unsigned int> dest( (size+ 7)/2, sourceData) );

现在，您可以编写典型的 for_each 块了。

typedef concurrency::array_view<const unsigned int> OriginalImage;
typedef concurrency::array_view<unsigned int> ResultImage;

bool Filters::Filter_Invert()
{
    const int size = k_width*k_height;
    const int maxVal = GetMaxSize();

    OriginalImage& im_original = GetOriginal();
    ResultImage& im_result = GetResult();
    im_result.discard_data();

    parallel_for_each(
        concurrency::extent<2>(k_width, k_height), 
        [=](concurrency::index<2> idx) restrict(amp)
    {
        const int pos = GetPos(idx);
        const int val = read_int16(im_original, pos);

        write_int16(im_result, pos, maxVal - val);
    });

    return true;
}

int Filters::GetPos( const concurrency::index<2>& idx )  restrict(amp, cpu)
{
    return idx[0] * Filters::k_height + idx[1];
}

魔法来了：

template <typename T>
unsigned int read_int16(T& arr, int idx) restrict(amp, cpu)
{
    return (arr[idx >> 1] & (0xFFFF << ((idx & 0x7) << 4))) >> ((idx & 0x7) << 4);
}

template<typename T>
void write_int16(T& arr, int idx, unsigned int val) restrict(amp, cpu)
{
    atomic_fetch_xor(&arr[idx >> 1], arr[idx >> 1] & (0xFFFF << ((idx & 0x7) << 4)));
    atomic_fetch_xor(&arr[idx >> 1], (val & 0xFFFF) << ((idx & 0x7) << 4));
}

请注意，这种方法适用于 8 位的 16 位，但它不会太难适应 8 位。其实这是基于8位版本的，可惜没找到参考。

希望对你有帮助。

大卫

【讨论】：

很高兴听到实际上有一个答案。我猜你提到的 8 位版本在这里描述：blogs.msdn.com/b/nativeconcurrency/archive/2012/01/17/…。我一直在阅读该文档，但由于所有这些技巧都是使其工作所必需的事实，我有点气馁。尽管如此，我也了解到即使是 OpenCL 也有它在处理 32bpp 以外的图像时存在问题，所以我仍然可以尝试一下。
您好。我想你会喜欢这个链接stackoverflow.com/questions/23329231/…。我可以让它与纹理一起工作（使用视觉 2013）
这将帮助你更多:) stackoverflow.com/questions/23376701/…