蒙特卡罗模拟运行速度明显慢于顺序答案

【问题标题】：Monte Carlo simulation runs significantly slower than sequential蒙特卡罗模拟运行速度明显慢于顺序
【发布时间】：2021-04-10 03:00:41
【问题描述】：

我对并发和并行编程的一般概念并不陌生。我正在尝试在 C 中使用 Monte Carlo method 计算 Pi。这是我的源代码：

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

int main(void)
{
    long points;
    long m = 0;
    double coordinates[2];
    double distance;
    printf("Enter the number of points: ");
    scanf("%ld", &points);

    srand((unsigned long) time(NULL));
    for(long i = 0; i < points; i++)
    {
        coordinates[0] = ((double) rand() / (RAND_MAX));
        coordinates[1] = ((double) rand() / (RAND_MAX));
        distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));
        if(distance <= 1)
            m++;
    }

    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

当我尝试使用 openmp api 使该程序并行运行时，它的运行速度几乎慢了 4 倍。

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{

    long total_points;              // Total number of random points which is given by the user
    volatile long total_m = 0;      // Total number of random points which are inside of the circle
    int threads = get_nprocs();     // This is needed so each thred knows how amny random point it should generate
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel
    {
       double coordinates[2];          // Contains the x and y of each random point
       long m = 0;                     // Number of points that are in the circle for any particular thread
       long points = total_points / threads;   // Number of random points that each thread should generate
       double distance;                // Distance of the random point from the center of the circle, if greater than 1 then the point is outside of the circle
       srand((unsigned long) time(NULL));

        for(long i = 0; i < points; i++)
        {
           coordinates[0] = ((double) rand() / (RAND_MAX));    // Random x
           coordinates[1] = ((double) rand() / (RAND_MAX));    // Random y
           distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));   // Calculate the distance
          if(distance <= 1)
              m++;
       }

       #pragma omp critical
       {
           total_m += m;
       }
    }

    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);
}

我尝试查找原因，但对不同的算法有不同的答案。

【问题讨论】：

标签： c++ c multithreading parallel-processing openmp

【解决方案1】：

您的代码中有两个开销来源，即critical region 和对rand() 的调用。而不是rand() 使用rand_r：

我认为您正在寻找 rand_r()，它明确采用当前RNG状态作为参数。然后每个线程都应该有它自己的种子数据副本（是否希望每个线程开始相同的种子或不同的种子取决于你在做什么，在这里你希望它们不同，否则你会一次又一次地得到同一行）。

可以使用 OpenMP 子句 reduction 删除临界区。此外，您既不需要调用sqrt，也不需要通过线程手动划分点（即long points = total_points / threads;），您可以使用#pragma omp for。所以您的代码如下所示：

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long total_points; 
    long total_m = 0;
    int threads = get_nprocs();   
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel 
    {                  
        unsigned int myseed = omp_get_thread_num();
        #pragma omp for reduction (+: total_m)
        for(long i = 0; i < total_points; i++){
            if(pow((double) rand_r(&myseed) / (RAND_MAX), 2) + pow((double) rand_r(&myseed) / (RAND_MAX), 2) <= 1)
               total_m++;
         }
     }
    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);

}

在我的机器上快速测试输入 1000000000：

sequential : 16.282835 seconds 
2 threads  :  8.206498 seconds  (1.98x faster)
4 threads  :  4.107366 seconds  (3.96x faster)
8 threads  :  2.728513 seconds  (5.96x faster)

请记住，我的机器只有 4 个内核。尽管如此，为了更有意义的比较，应该尽量优化顺序代码，然后将其与并行版本进行比较。自然，如果顺序版本尽可能优化，并行版本的加速可能会下降。例如，针对@user3666197 提供的顺序代码版本，在不修改的情况下测试当前并行版本，会产生以下结果：

sequential :  9.343118 seconds 
2 threads  :  8.206498 seconds  (1.13x faster)
4 threads  :  4.107366 seconds  (2.27x faster)
8 threads  :  2.728513 seconds  (3.42x faster)

但是，也可以改进并行版本，等等等等。例如，如果使用@user3666197 版本，修复coordinates 更新的竞争条件（线程间共享），并添加OpenMP #pragma omp for，我们有以下代码：

int main(void)
{
    double start = omp_get_wtime();
    long points = 1000000000; //....................................... INPUT AVOIDED
    long m = 0;
    unsigned long HAUSNUMERO = 1;
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX;

    int threads = get_nprocs();
    omp_set_num_threads(threads);
    #pragma omp parallel reduction (+: m )
    {
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
        #pragma omp for nowait
        for(long i = 0; i < points; i++)
        {
            double x = rand_r( &aThreadSpecificSEED_x );
            double y = rand_r( &aThreadSpecificSEED_y );
            m += (1  >= ( x * x + y * y ) * DIV1byMAXbyMAX);
        }
    }
    double end = omp_get_wtime();
    printf("%f\n",end-start);
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

产生以下结果：

sequential :  9.160571 seconds 
2 threads  :  4.769141 seconds  (1.92 x faster)
4 threads  :  2.456783 seconds  (3.72 x faster)
8 threads  :  2.203758 seconds  (4.15 x faster)

我正在使用标志 -O3 -std=c99 -fopenmp 进行编译，并使用 gcc 版本 4.9.3 (MacPorts gcc49 4.9.3_0)。

【讨论】：

【解决方案2】：

在Amdahl's Law 参数之外添加几美分

在循环中有一个极其微不足道的“有用”工作，AVX-512 寄存器并行和 SIMD 对齐技巧很可能优于任何针对 points << 1E15+ 的 OpenMP 重量级处理准备。

提供这个答案是为了启发代码在其他地方可以节省大量成本，因为分析上等效的问题公式（避免昂贵的SQRT-s 和DIV-s，不会收到任何附加值）

该代码可用于Godbolt.org IDE 的任何进一步在线实验和分析。

在Godbolt.org IDE 上修改了简化代码，以便进一步重新测试。

建议定时部分留给@dreamcrash，因为它有一个水平平原，用于重新测试并进行有意义的比较：

#include <stdio.h> //............................. -O3 -fopenmp
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long points = 1000; //....................................... INPUT AVOIDED
    long m = 0;
//  double coordinates[2]; //.................................... OBVIOUS TO BE PUT IN PRIVATE PART
    unsigned long HAUSNUMERO = 1; //............................. AVOID SIN OF IREPRODUCIBILITY
//  printf( "RAND_MAX is %ld on this platform\n", RAND_MAX );//.. 2147483647 PLATFORM SPECIFIC
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX; //......... PRECOMPUTE A STATIC VALUE

    int threads = get_nprocs();
    omp_set_num_threads(threads);

    #pragma omp parallel reduction (+: m )
    {
    //..............................SEED.x PRINCIPALLY INDEPENDENT FOR MUTUALLY RANDOM SEQ-[x,y]
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
    //..............................SEED.y PRINCIPALLY INDEPENDENT FOR MUTUALLY RANDOM SEQ-[x,y]
        double x, y;

        for(long i = 0; i < points / threads; i++)
        {   
            x = rand_r( &aThreadSpecificSEED_x );
            y = rand_r( &aThreadSpecificSEED_y );

            if( 1  >= ( x * x //................. NO INTERIM STORAGE NEEDED
                      + y * y //................. NO SQRT EVER NEEDED
                        ) * DIV1byMAXbyMAX //.... MUL is WAY FASTER THAN DIV
                   )
            m++;
        }
    }
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

【讨论】：

顺便说一句，您的代码中有竞争条件，即坐标[0] = rand_r( &aThreadSpecificSEED_x );和坐标[1] = rand_r( &aThreadSpecificSEED_y );由于双坐标[2]；在线程之间共享。
@dreamcrash 很高兴帮助您将 HPC 级（当然，仅合成）工作负载部分缩短了半秒以上（非常符合 Amdahl 定律论点，关于定律收益递减，受流程的纯 [SERIAL] 部分的限制，更容易被学生和非 HPC 用户理解）。 HPC 不是关于效率和尽可能高的处理性能吗？其余的都是初级的。关于扭曲所谓的“超线性”加速的审查文本具有明显的价值，但这种论点无法成立（橙子 2 苹果）PF2021
感谢您的帮助，在这一点上我从来没有反对过，实际上我曾与声称使用 GPU 加速 400 倍的人进行过类似的讨论，其中大部分所述加速来自次优顺序代码。我的观点是，它超出了这个答案的范围，但是，我应该从一开始就在我的回答中明确指出结果可能会误导。

【解决方案3】：

您遇到的问题是使用函数rand() 所固有的，该函数不需要可重入。因此，当多个线程进入此函数时，线程之间会竞争以非线程安全的方式读写数据。这种竞争导致极其缓慢的行为。除了函数rand()，你可以寻找一个类似的可重入函数来解决这个问题。

【讨论】：

抱歉，上述假设可能与生成序列的最终可重现密码测试相关，但在已发布（极其纳米级）问题的上下文中，它错误地引导了人们的注意力。与在极少数循环中执行的“有用”工作的微型规模相比，体验惊喜的核心问题与所有 OpenMP 并行进程设置/通信和终止的附加开销成本有关。为 1E{6,9,12,...}+ 循环测试相同（效率极低）的代码并查看这些成本已解决 - stackoverflow.com/a/65224536

【解决方案4】：

您需要将rand() 替换为仅访问局部变量的线程特定随机数生成器。否则线程会竞争同步同一缓存行。

【讨论】：