Boost.Compute 比普通 CPU 慢？答案

【问题标题】：Boost.Compute slower than plain CPU?Boost.Compute 比普通 CPU 慢？
【发布时间】：2014-08-08 10:41:54
【问题描述】：

我刚开始玩 Boost.Compute，想看看它能给我们带来多大的速度，我写了一个简单的程序：

#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>

namespace compute = boost::compute;

int main()
{
    // generate random data on the host
    std::vector<float> host_vector(16000);
    std::generate(host_vector.begin(), host_vector.end(), rand);

    BOOST_FOREACH (auto const& platform, compute::system::platforms())
    {
        std::cout << "====================" << platform.name() << "====================\n";
        BOOST_FOREACH (auto const& device, platform.devices())
        {
            std::cout << "device: " << device.name() << std::endl;
            compute::context context(device);
            compute::command_queue queue(context, device);
            compute::vector<float> device_vector(host_vector.size(), context);

            // copy data from the host to the device
            compute::copy(
                host_vector.begin(), host_vector.end(), device_vector.begin(), queue
            );

            auto start = boost::chrono::high_resolution_clock::now();
            compute::transform(device_vector.begin(),
                       device_vector.end(),
                       device_vector.begin(),
                       compute::sqrt<float>(), queue);

            auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
            auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
            std::cout << "ans: " << ans << std::endl;
            std::cout << "time: " << duration.count() << " ms" << std::endl;
            std::cout << "-------------------\n";
        }
    }
    std::cout << "====================plain====================\n";
    auto start = boost::chrono::high_resolution_clock::now();
    std::transform(host_vector.begin(),
                host_vector.end(),
                host_vector.begin(),
                [](float v){ return std::sqrt(v); });

    auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
    auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
    std::cout << "ans: " << ans << std::endl;
    std::cout << "time: " << duration.count() << " ms" << std::endl;

    return 0;
}

以及我的机器上的示例输出（Win7 64位）：

====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms

我的问题是：为什么平原（非OpenCL）版本更快？

【问题讨论】：

您可以查看stackoverflow.com/questions/23901979/… span>
甚至不阅读代码，您的样本太小，无法进行性能比较...
@ user1535111，是的，我确实在这篇文章之前读到了。 span>
@Jamboree 所以你不认为差距来自内核的编译吗？
@Jamboree 看来 boost::compute 会缓存编译后的内核，所以你可以在第一次计时之前使用 boost::transform 和 boost::accumulate。

标签： c++ boost opencl boost-compute

【解决方案1】：

我可以看出造成巨大差异的一个可能原因。比较 CPU 和 GPU 的数据流：-

CPU              GPU

                 copy data to GPU

                 set up compute code

calculate sqrt   calculate sqrt

sum              sum

                 copy data from GPU

鉴于此，英特尔芯片似乎在一般计算方面有点垃圾，NVidia 可能正在遭受额外的数据复制和设置 GPU 来进行计算。

您应该尝试相同的程序，但操作要复杂得多 - sqrt 和 sum 过于简单，无法克服使用 GPU 的额外开销。例如，您可以尝试计算 Mandlebrot 点数。

在您的示例中，将 lambda 移动到累积中会更快（一次通过内存与两次通过）

【讨论】：

【解决方案2】：

你得到不好的结果是因为你不正确地测量时间。

OpenCL 设备有自己的时间计数器，与主机计数器无关。每个 OpenCL 任务都有 4 个状态，可以查询其定时器：（来自Khronos web site)

CL_PROFILING_COMMAND_QUEUED，当事件标识的命令被主机排入命令队列时
CL_PROFILING_COMMAND_SUBMIT，当由已入队的事件标识的命令由主机提交给与命令队列关联的设备时。
CL_PROFILING_COMMAND_START，当事件标识的命令开始在设备上执行时。
CL_PROFILING_COMMAND_END，当事件标识的命令在设备上完成执行时。

请注意，计时器是设备端。因此，要测量内核和命令队列的性能，您可以查询这些计时器。在您的情况下，需要 2 个最后一个计时器。

在您的示例代码中，您正在测量主机时间，其中包括数据传输时间（如 Skizz 所说）以及浪费在命令队列维护上的所有时间。 p>

因此，要了解实际的内核性能，您需要将 cl_event 传递给您的内核（不知道如何在 boost::compute 中执行此操作）并查询该事件以获取性能计数器，或者使您的内核非常庞大且难以隐藏所有间接费用。

【讨论】：

我的意思是测量主机时间，因为我想知道 OpenCL 与正常解决方案相比的性能。我认为设备端性能计数器更适合比较用 OpenCL 编写的不同算法。

【解决方案3】：

正如其他人所说，您的内核中很可能没有足够的计算量，因此值得在 GPU 上运行一组数据（您受到内核编译时间和向 GPU 传输时间的限制） .

为了获得更好的性能数据，您应该多次运行该算法（并且很可能丢弃第一个算法，因为这会更大，因为它包括编译和存储内核的时间）。

此外，您不应将transform() 和accumulate() 作为单独的操作运行，而应使用融合的transform_reduce() 算法，该算法使用单个内核执行转换和归约。代码如下所示：

float ans = 0;
compute::transform_reduce(
    device_vector.begin(),
    device_vector.end(),
    &ans,
    compute::sqrt<float>(),
    compute::plus<float>(),
    queue
);
std::cout << "ans: " << ans << std::endl;

您还可以使用带有-DBOOST_COMPUTE_USE_OFFLINE_CACHE 的Boost.Compute 编译代码，这将启用离线内核缓存（这需要与boost_filesystem 链接）。然后，您使用的内核将存储在您的文件系统中，并且仅在您第一次运行应用程序时编译（Linux 上的 NVIDIA 默认已经这样做了）。

【讨论】：

transform_reduce 在这种情况下确实表现得更好，我也尝试了具有自定义功能的等效accumulate，但它不如transform_reduce，结果有些不同。
这是意料之中的。对于浮点加法（不像整数加法那样可交换），accumulate() 将使用较慢的非并行代码路径。