将推力与 openmp 一起使用：没有获得实质性的加速答案

【问题标题】：Using thrust with openmp: no substantial speed up obtained将推力与 openmp 一起使用：没有获得实质性的加速
【发布时间】：2014-12-13 11:26:04
【问题描述】：

我有兴趣将我主要使用 Thrust GPU 库编写的代码移植到多核 CPU。值得庆幸的是，the website 表示推力代码可用于 OpenMP / Intel TBB 等线程环境。

我在下面编写了一个简单的代码，用于对大型数组进行排序，以查看使用最多可支持 16 个 Open MP 线程的机器的加速。

在这台机器上对大小为 1600 万的随机数组进行排序得到的时序是

STL：1.47 秒
推力（16 线程）：1.21 秒

似乎几乎没有任何加速。我想知道如何像使用 GPU 一样使用 OpenMP 来大幅加快对数组进行排序的速度。

代码如下（文件 sort.cu）。编译如下：

nvcc -O2 -o sort sort.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp

NVCC 版本是 5.5 使用的 Thrust 库版本是 v1.7.0

#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>    
#include <ctime>
#include <time.h>
#include "thrust/sort.h"    

int main(int argc, char *argv[])
{
  int N = 16000000;
  double* myarr = new double[N];

  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX; 
     }
  std::cout << "-------------\n";

  clock_t start,stop;
  start=clock();
  std::sort(myarr,myarr+N);
  stop=clock();

  std::cout << "Time taken for sorting the array with STL  is " << (stop-start)/(double)CLOCKS_PER_SEC;

  //--------------------------------------------

  srand(1);
  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX; 
      //std::cout << myarr[i] << std::endl;
     }

  start=clock();
  thrust::sort(myarr,myarr+N);
  stop=clock();

  std::cout << "------------------\n";


  std::cout << "Time taken for sorting the array with Thrust  is " << (stop-start)/(double)CLOCKS_PER_SEC;
  return 0;
}

【问题讨论】：

不要使用clock()，使用omp_get_wtime()。排序是我一直想要研究的东西，但由于它是nlog(n)，我的猜测是该操作受内存带宽限制，因此它不能从多个快速内核中受益匪浅。 GPU（或至强融核）的情况有所不同，因为“核心”速度与内存速度之间的比率要低得多。

标签： multithreading sorting thrust

【解决方案1】：

device backend refers to the behavior of operations performed on a thrust::device_vector 或类似参考。 Thrust 将您传递的数组/指针解释为主机指针，并对其执行基于主机的操作，这些操作不受设备后端设置的影响。

有多种方法可以解决此问题。如果您阅读设备后端文档，您会发现一般示例和特定于 omp 的示例。我认为，您甚至可以指定一个不同的host backend，它应该具有您的代码所需的行为（OMP 使用）。

一旦你解决了这个问题，你可能会得到一个额外的惊喜结果：thrust 似乎可以快速对数组进行排序，但报告的执行时间很长。我相信这是由于the clock() function being affected by the number of OMP threads in use.

以下代码/示例运行解决了这些问题，并且似乎给了我约 3 倍的 4 线程加速。

$ cat t592.cu
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <sys/time.h>
#include <time.h>
#include <thrust/device_ptr.h>
#include <thrust/sort.h>

int main(int argc, char *argv[])
{
  int N = 16000000;
  double* myarr = new double[N];

  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX;
     }
  std::cout << "-------------\n";

  timeval t1, t2;
  gettimeofday(&t1, NULL);
  std::sort(myarr,myarr+N);
  gettimeofday(&t2, NULL);
  float et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);

  std::cout << "Time taken for sorting the array with STL  is " << et << std::endl;;

  //--------------------------------------------

  srand(1);
  for (int i = 0; i < N; ++i)
    {
      myarr[i] = (1.0*rand())/RAND_MAX;
      //std::cout << myarr[i] << std::endl;
     }
  thrust::device_ptr<double> darr = thrust::device_pointer_cast<double>(myarr);
  gettimeofday(&t1, NULL);
  thrust::sort(darr,darr+N);
  gettimeofday(&t2, NULL);
  et = (((t2.tv_sec*1000000)+t2.tv_usec)-((t1.tv_sec*1000000)+t1.tv_usec))/float(1000000);

  std::cout << "------------------\n";


  std::cout << "Time taken for sorting the array with Thrust  is " << et << std::endl   ;
  return 0;
}

$ nvcc -O2 -o t592 t592.cu -Xcompiler -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_BACKEND_OMP -lgomp
$ OMP_NUM_THREADS=4 ./t592
-------------
Time taken for sorting the array with STL  is 1.31956
------------------
Time taken for sorting the array with Thrust  is 0.468176
$

您的里程可能会有所不同。特别是，当您超过 4 个线程时，您可能看不到任何改进。可能有许多因素会阻止 OMP 代码扩展到超过一定数量的线程。排序通常是一种受内存限制的算法，因此您可能会观察到增加，直到您的内存子系统饱和，然后不会再增加内核。根据您的系统，您可能已经处于这种情况，在这种情况下，您可能看不到 OMP 样式多线程的任何改进。

【讨论】：

没有对nlog(n) 操作进行排序，因此它应该受内存带宽限制，并且不会从多个快速内核（在单插槽系统上）中受益匪浅。但我猜你的结果表明我错了......
实际上，内存带宽是性能提高到大约 4 个核心但随后趋于平稳的最可能原因。现代英特尔至强无法满足来自单个内核的请求的内存总线。为了使用单个 Xeon 插槽上可用的全部内存带宽，有必要从多个内核发出请求。但是大约 4 个内核应该足以使连接到单个插槽的总线饱和。在这种情况下，我的处理器是英特尔 Ivybridge Xeon。
我同意即使在单个套接字系统上也需要多个线程来最大化带宽，我在measuring-memory-bandwidth-from-the-dot-product-of-two-arrays 发现了这一点。但根据我的经验，多线程将带宽增加了不到两倍，而你得到了三倍，这超出了我的预期。
您的 IvyBridge Xeon 有多少个真正的内核？处理器到底是什么？
它是 Xeon E5-2697 v2，12 核。在我的系统上，推力和 STL 之间的单线程比较是 STL 大约为 1.32 秒，而推力为 0.88 秒（OP 得到不同的比率）。所以实际上我只从多线程贡献中获得了大约 2 倍的收益，这可能与可用内存带宽的增加一致。