【问题标题】:My GPU accelerated opencv code is slower than normal opencv我的 GPU 加速 opencv 代码比普通 opencv 慢
【发布时间】:2020-05-24 06:53:23
【问题描述】:

我从“Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA”一书中复制了两个例子来比较 CPU 和 GPU 的性能。

第一个代码:

    cv::Mat src = cv::imread("D:/Pics/Pen.jpg", 0); // Pen.jpg is a 4096 * 4096 GrayScacle picture.
    cv::Mat result_host1, result_host2, result_host3, result_host4, result_host5;

    //Get initial time in miliseconds
    int64 work_begin = getTickCount();
    cv::threshold(src, result_host1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::threshold(src, result_host2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::threshold(src, result_host3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::threshold(src, result_host4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::threshold(src, result_host5, 128.0, 255.0, cv::THRESH_TOZERO_INV);

    //Get time after work has finished     
    int64 delta = getTickCount() - work_begin;
    //Frequency of timer
    double freq = getTickFrequency();
    double work_fps = freq / delta;
    std::cout << "Performance of Thresholding on CPU: " << std::endl;
    std::cout << "Time: " << (1 / work_fps) << std::endl;
    std::cout << "FPS: " << work_fps << std::endl;
    return 0;

第二个代码:

    cv::Mat h_img1 = cv::imread("D:/Pics/Pen.jpg", 0);  // Pen.jpg is a 4096 * 4096 GrayScacle picture.
    cv::cuda::GpuMat d_result1, d_result2, d_result3, d_result4, d_result5, d_img1;
    //Measure initial time ticks
    int64 work_begin = getTickCount();
    d_img1.upload(h_img1);
    cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);

    cv::Mat h_result1, h_result2, h_result3, h_result4, h_result5;
    d_result1.download(h_result1);
    d_result2.download(h_result2);
    d_result3.download(h_result3);
    d_result4.download(h_result4);
    d_result5.download(h_result5);
    //Measure difference in time ticks
    int64 delta = getTickCount() - work_begin;
    double freq = getTickFrequency();
    //Measure frames per second
    double work_fps = freq / delta;
    std::cout << "Performance of Thresholding on GPU: " << std::endl;
    std::cout << "Time: " << (1 / work_fps) << std::endl;
    std::cout << "FPS: " << work_fps << std::endl;
    return 0;

一切正常,除了:

“GPU 速度低于 CPU”

第一个结果:

    Performance of Thresholding on CPU:
    Time: 0.0475497 
    FPS: 21.0306

第二个结果:

    Performance of Thresholding on GPU:
    Time: 0.599032
    FPS: 1.66936

然后,我决定撤销上传下载时间:

第三个代码:

    cv::Mat h_img1 = cv::imread("D:/Pics/Pen.jpg", 0);  // Pen.jpg is a 4096 * 4096 GrayScacle picture.
    cv::cuda::GpuMat d_result1, d_result2, d_result3, d_result4, d_result5, d_img1;
    d_img1.upload(h_img1);
    //Measure initial time ticks
    int64 work_begin = getTickCount();
    cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
    //Measure difference in time ticks
    int64 delta = getTickCount() - work_begin;
    double freq = getTickFrequency();
    //Measure frames per second
    double work_fps = freq / delta;
    std::cout << "Performance of Thresholding on GPU: " << std::endl;
    std::cout << "Time: " << (1 / work_fps) << std::endl;
    std::cout << "FPS: " << work_fps << std::endl;

    cv::Mat h_result1, h_result2, h_result3, h_result4, h_result5;
    d_result1.download(h_result1);
    d_result2.download(h_result2);
    d_result3.download(h_result3);
    d_result4.download(h_result4);
    d_result5.download(h_result5);
    return 0;

但是,问题一直存在

第三个结果:

Performance of Thresholding on GPU: 
Time: 0.136095
FPS: 7.34779

我对这个问题感到困惑。

         1st         2nd         3rd
         CPU         GPU         GPU
Time: 0.0475497   0.599032    0.136095
FPS:  21.0306     1.66936     7.34779

请帮帮我。

GPU 规格:

*********************************************************
NVIDIA Quadro K2100M

Micro architecture: Kepler

Compute capability version: 3.0

CUDA Version: 10.1
*********************************************************

我的系统规格:

*********************************************************
laptop hp ZBook

CPU: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz 2.90 GHZ

RAM: 8.00 GB

OS: Windows 7, 64-bit, Ultimate, Service Pack 1
*********************************************************

【问题讨论】:

    标签: c++ opencv gpu


    【解决方案1】:

    我能想到 CPU 版本即使没有内存操作也更快的两个原因:

    1. 在第 2 和第 3 代码版本中,您声明了结果 GpuMats 但实际上并未初始化它们,结果 GpuMats 的初始化将通过调用 GpuMat 在 threshold 方法内发生.create,这导致每次执行都会分配 80MB 的 GPU 内存,您可以通过将结果 GpuMats 初始化一次然后重用它们来看到“性能提升”。 使用原始的第三个代码,我得到以下结果(Geforce RTX 2080):

    时间:0.010208 FPS:97.9624

    当我将代码更改为:

    ...
    d_resut1.create(h_img1.size(), CV_8UC1);
    d_result2.create(h_img1.size(), CV_8UC1);
    d_result3.create(h_img1.size(), CV_8UC1);
    d_result4.create(h_img1.size(), CV_8UC1);
    d_result5.create(h_img1.size(), CV_8UC1);
    d_img1.upload(h_img1);
    //Measure initial time ticks
    int64 work_begin = getTickCount();
    cv::cuda::threshold(d_img1, d_result1, 128.0, 255.0, cv::THRESH_BINARY);
    cv::cuda::threshold(d_img1, d_result2, 128.0, 255.0, cv::THRESH_BINARY_INV);
    cv::cuda::threshold(d_img1, d_result3, 128.0, 255.0, cv::THRESH_TRUNC);
    cv::cuda::threshold(d_img1, d_result4, 128.0, 255.0, cv::THRESH_TOZERO);
    cv::cuda::threshold(d_img1, d_result5, 128.0, 255.0, cv::THRESH_TOZERO_INV);
    ...
    

    我得到以下结果(好 2 倍) 时间:0.00503374 FPS:198.659

    虽然 GpuMat 结果预分配带来了主要的性能提升,但对 CPU 版本的相同修改不会。

    2. K2100M 不是一个非常强大的 GPU(576 个内核 @ 665 MHz)并且考虑到 OpenCV 可能(取决于您如何编译它)使用 SIMD 指令的多线程在 CPU(2.90GHz,8 个虚拟内核)版本的引擎盖下,结果并不令人惊讶

    编辑: 通过使用 NVIDIA Nsight 系统分析应用程序,您可以更好地了解 GPU 内存操作的损失:

    如您所见,仅分配和释放内存需要 10.5 毫秒,而阈值本身只需要 5 毫秒

    【讨论】:

      猜你喜欢
      • 2012-08-17
      • 1970-01-01
      • 2011-11-23
      • 2015-10-09
      • 1970-01-01
      • 2013-01-11
      • 2016-10-23
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多